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Abstract 

To  be  practical,  recognition  systems  must  deal  with  uncertainty.  Positions  of  image 
features  in  scenes  vary.  Features  sometimes  fail  to  appear  because  of  unfavorable  illu¬ 
mination.  In  this  work,  methods  of  statistical  inference  are  combined  with  empirical 
models  of  uncertainty  in  order  to  evaluate  and  refine  hypotheses  about  the  occurrence 
of  a  known  object  in  a  scene. 

Probabilistic  models  are  used  to  characterize  image  features  and  their  correspon¬ 
dences.  .A  statistical  approach  is  taken  for  the  acquisition  of  object  models  from 
observations  in  images:  Mean  Edge  Images  are  used  to  capture  object  features  that 
are  reasonably  stable  with  respect  to  variations  in  illunnnation. 

The  Alignment  approach  to  recognition,  that  has  been  described  by  Huttenlocher 
and  rilman.  is  used.  Tlie  mechanisms  that  are  employed  to  generate  initial  hypothe¬ 
ses  are  distinct  from  those  that  are  used  to  verify  (and  refine)  them.  In  this  work, 
posterior  probability  and  .Maximum  Likelihood  are  the  criteria  for  evaluating  and 
refining  hypotheses.  The  recognition  strategy  advocated  in  this  work  may  be  sum¬ 
marized  as  Align  Refine  Verify,  whereby  local  search  in  pose  space  is  utilized  to  refine 
hypotheses  from  the  alignment  stage  before  verification  is  carried  out. 

Two  formulations  of  model-based  object  recognition  are  described.  .M.AP  Model 
.Matching  evaluates  joint  hypotheses  of  match  and  pose,  while  Posterior  .Marginal 
Pose  Estimation  evaluates  the  pose  only.  Local  search  in  pose  space  is  carried  out 
with  the  Expectation-.Maximization  (E.M)  algorithm. 

Recognition  experiments  are  described  where  the  EM  algorithm  is  used  to  refine 
and  evaluate  pose  hypotheses  in  2D  and  .ID.  Initial  hypotheses  for  the  2D  experiments 
were  generated  by  a  simple  indexing  method:  Angle  Pair  Indexing.  The  Linear 
Combination  of  Views  method  of  ITlman  and  Basri  is  employed  as  the  projection 
model  in  the  3D  experiments. 
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Chapter  1 


Introduction 


Visual  object  recognition  is  the  focus  of  the  research  reported  in  this  thesis.  Recogni¬ 
tion  must  deal  with  uncertainty  to  be  practical.  Positions  of  image  features  belonging 
to  objects  in  scenes  vary.  Features  sometimes  fail  to  appear  because  of  unfavorable 
illumination.  In  tliis  work,  methods  of  statistical  inference  are  combined  with  empir¬ 
ical  models  of  uncertainty  in  order  to  evaluate  hypotheses  about  the  occurrence  of  a 
known  object  in  a  scene.  Other  problems,  such  as  the  generation  of  initial  hypotheses 
and  the  acquisition  of  object  model  features  are  also  addressed. 


1.1  The  Problem 

Representative  recognition  problems  and  their  solutions  are  illustrated  in  Figures  1-1 
and  1-2.  The  problem  is  to  detect  and  locate  the  car  in  digitized  video  images,  using 
previously  available  detailed  information  about  the  car.  In  these  figures,  object  model 
features  are  superimposed  over  the  video  images  at  the  position  and  orientation  w  here 
the  car  was  found.  Figure  1-1  shows  the  results  of  2D  recognition,  while  Figure  1-2 
illustrates  the  results  of  3D  recognition.  These  images  are  from  experiments  that  are 
described  in  Chapter  10.  Practical  solutions  to  problems  like  these  will  improve  the 
flexibility  of  robotic  systems. 


1.2.  THE  APPROACH 


In  this  work,  the  recognition  prohleiii  is  restricteri  to  finding  occurrences  of  a  single 
object  in  scenes  that  may  contain  other  unknown  objects.  Despite  the  simplification 
and  years  of  research,  the  problem  remains  largely  unsolved.  Robust  systems  that 
can  recognize  smooth  objects  having  six  degrees  of  freedom  of  position,  under  varying 
conditions  of  illumination,  occlusion,  and  background,  are  not  commercially  a\ailable. 
Much  effort  has  been  expended  on  this  problem  as  is  evident  in  the  comprehensi\e 
reviews  of  research  in  computer-based  object  recognition  by  Besl  and  .Jain  [o],  who 
cited  20d  references,  and  Chin  and  Dyer  [18],  who  cited  15')  references.  The  goal  of 
this  thesis  is  to  characterize,  as  well  as  to  describe  how  to  find,  robust  solutions  to 
visual  object  recognition  problems. 

1.2  The  Approach 

In  this  work,  statistical  methods  are  used  to  evaluate  and  refine  hypotheses  in  object 
recognition.  Angle  Pair  Indexing,  a  means  of  generating  hypotheses,  is  introduced. 
These  mechanisms  are  used  in  an  extension  of  the  Alignment  method  that  includes  a 
pose  refinement  step.  Each  of  these  components  are  amplified  below. 

1.2.1  Statistical  Approach 

In  this  research,  visual  object  recognition  is  approached  via  the  principles  of  Maximum 
Likelihood  (,V1L)  and  Maximum  A-Posteriori  prol)ability  (MAP).  These  principles, 
along  with  specific  probabilistic  models  of  aspects  of  object  recognition,  are  used  to 
derive  objective  functions  for  evaluating  and  refining  recognition  hypotheses.  The  ML 
and  M.AP  criteria  have  a  long  history  of  successful  application  in  formulating  derisions 
and  in  making  estimates  from  observed  data.  They  have  attractive  properties  of 
optiniality  and  are  often  useful  when  measurement  errors  are  significant. 

In  other  areas  of  computer  vision,  statistics  has  proven  useful  as  a  theoretical 
framework.  The  work  of  Auille.  (Jeiger  and  Biilthoff  on  stereo  [7S]  is  one  example. 
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while  in  image  restoration  the  work  of  Cleman  and  Cleman  [28].  .Marroquin  [■')4].  and 
.\larroquin.  Mitter  and  Poggio  [oo]  are  others.  The  statistical  approach  that  is  used 
in  this  thesis  'onverts  the  recognition  problem  into  a  well  defined  (although  not  nec¬ 
essarily  eas  I  optimization  problem.  This  hcis  the  advantage  of  providing  an  explicit 
characterization  of  the  problem,  while  separating  it  from  the  description  of  the  aigo- 
rithms  used  to  solve  it.  .4d  hoc  objective  functions  have  been  profitably  used  in  some 
areas  of  computer  vision.  Such  an  approach  is  used  by  Barnard  in  stereo  matching 
[2],  Blake  and  Zisserman  [7]  in  image  restoration  and  Beveridge.  Weiss  and  Hiseman 
[6]  in  line  segment  based  model  matching.  With  this  approach,  plausible  forms  fur 
components  of  the  objective  function  are  often  combined  using  trade-off  parameters. 
Such  trade-off  parameters  are  determined  empirically.  An  advantage  of  deriving  ol)- 
jective  functions  from  statistical  theories  is  that  assumptions  become  explicit  -  the 
forms  of  the  objective  function  components  are  clearly  related  to  specific  probabilistic 
models.  If  these  models  fit  the  domain  then  there  is  some  assurance  that  the  resulting 
criteria  will  perform  well.  A  second  advantage  is  that  the  trade-off  parameters  in  the 
objective  function  can  be  derived  from  measurable  statistics  of  the  domain. 

1.2.2  Feature-Based  Recognition 

This  work  uses  a  feature-based  approach  to  object  recognition.  Features  are  abstrac¬ 
tions  like  points  or  curves  that  sununarize  some  structure  of  the  patterns  in  an  image. 
There  are  several  reascjiis  for  using  feature  based  approaches  to  object  recognition. 

•  Features  can  concisely  represent  objects  and  images.  Features  derived  from 
brightness  edges  can  summarize  the  important  events  of  an  image  in  a  way  that 
is  reasonably  stable  with  respect  to  scene  illumination. 

•  In  the  alignment  approach  to  recognition  (to  be  described  shortly).  hypothes(cs 
are  verified  by  projecting  the  object  model  into  the  image,  then  comparing  the 
})rediction  against  the  image.  By  using  compact,  feature-based  representations 
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of  the  object,  projection  costs  may  be  kept  low. 

•  Features  also  facilitate  hypothesis  generation.  Indexing  methods  are  attractive 
mechanisms  for  hypothesis  generation.  Such  methods  use  tables  indexed  by 
properties  of  small  groups  of  image  features  to  quickly  locate  corresponding 
model  features. 

Object  Features  from  Observation 

A  major  issue  that  must  be  faced  in  model-based  object  recognition  concerns  the 
origin  of  the  object  model  itself.  The  object  features  that  are  used  in  this  work  are 
derived  from  actual  image  observations.  This  method  of  feature  acquisition  automat¬ 
ically  favors  those  features  that  are  likely  to  be  detected  in  images.  The  potentially 
difficult  problem  of  predicting  image  features  from  abstract  geometric  models  is  by¬ 
passed.  This  prediction  problem  is  manageable  in  some  constrained  domains  (with 
polyhedral  objects,  for  instance)  but  it  is  often  difficult,  especially  with  smooth  ob¬ 
jects,  low  resolution  images  and  lighting  variations. 

For  robustness,  simple  local  image  features  are  used  in  this  work.  Features  of  this 
sort  are  easily  detec  1  in  contrast  to  extended  features  like  line  segments.  Extended 
features  iiave  been  used  in  some  systems  for  hypothesis  generation  because  their  ad¬ 
ditional  structure  provides  more  constraint  than  that  offered  by  simple  local  features. 
Extended  features,  nonetheless,  have  drawbacks  in  being  difficult  to  detect  due  to 
occlusions  and  localized  failures  of  image  contrast.  Because  of  this,  systems  that  rely 
on  distinguished  features  can  lose  robustness. 

1.2.3  Alignment 

Hypothesize-and-test.  or  alignment  methods  have  proven  effective  in  visual  object 
recognition.  Huttenlocher  and  Fllman  [4’.f]  used  search  over  minimal  sets  of  corre¬ 
sponding  features  to  establish  candidate  hypotheses.  In  their  work  these  ’  ypotheses. 
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or  Alignments,  are  tested  by  projecting  the  object  model  into  the  image  using  the 
pose  (position  and  orientation)  implied  by  the  hypothesis,  and  then  by  performing  a 
detailed  comparison  with  the  image.  The  basic  strategy  of  the  alignment  method  is 
to  use  separate  mechanisms  for  generating  and  testing  hypotheses. 

Recently,  indexing  methods  have  become  available  for  efficiently  generating  hy¬ 
potheses  in  recognition.  These  methods  avoid  a  significant  amount  of  search  b}-  using 
pre-computed  tables  for  looking  up  the  object  features  that  might  correspond  to  a 
group  of  image  features.  The  geometric  hashing  method  of  Lamdan  and  Wolfson  [49] 
uses  invariant  properties  of  small  groups  of  features  under  affine  transformations  as 
the  look-up  key.  Clemens  and  .Jacobs  [19]  [20].  and  .Jacobs  [4.5]  described  indexing 
methods  that  gain  efficiency  by  using  a  feature  grouping  process  to  select  small  sets 
of  image  features  that  are  likely  to  belong  to  one  object  in  the  scene. 

In  this  work,  a  simple  form  of  2D  indexing,  Angle  Pair  Indexing,  is  used  to  generate 
initial  hypotheses.  It  uses  an  invariant  property  of  pairs  of  image  features  under 
translation,  rotation  and  scale.  This  is  described  in  Chapter  9. 

The  Hough  transform  [40]  [44]  is  another  commonly  used  method  for  generating 
hypotheses  in  object  recognition.  In  the  Hough  method,  feature-based  clustering  is 
performed  in  pose  space,  the  space  of  the  transformations  describing  the  possible 
motion  of  the  object.  This  method  was  iscd  by  Crimson  and  Lozano- Perez  [36]  to 
localize  the  search  in  recognition. 

These  fast  methods  of  hypothesis  generation  provide  ongoing  reasons  for  using  the 
alignment  approach.  They  are  often  most  effective  when  used  in  conjunction  with 
verification.  Verification  is  important  because  indexing  methods  can  be  susceptible 
to  table  collisions,  while  Hough  methods  sometimes  generate  false  positives  due  t< 
their  aggregation  of  inconsistent  evidence  in  pose  space  bins.  1  ids  last  point  has  been 
argued  lyv  Clrimson  and  Huttenlocher  [3.5]. 

The  usual  alignment  strategy  may  be  summarized  as  align  virify.  .'Mignment  and 
verification  place  differing  pressures  on  the  choice  of  features  for  recognition.  .Mech- 
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anisms  used  for  generating  hypotheses  typically  have  coinputational  complexity  tliat 
is  polynomial  in  the  number  of  features  involved.  Becau.se  of  this,  there  is  significant 
advantage  to  using  low  resolution  features  -  there  are  fewer  of  them.  Cnfortunately. 
pose  estimates  based  on  coarse  features  tend  to  be  less  accurate  than  those  based  on 
high  resolution  features. 

Likewise,  verification  is  usually  more  reliable  with  high  resolution  features.  This 
approach  yields  more  detailed  comparisons.  These  differing  pressures  may  be  accom¬ 
modated  by  employing  eoarfit -fin(  approaches.  The  coarse-fine  strategy  was  utilizetf 
successfully  in  stereo  by  CJrimson  [.33].  In  the  coarse-fine  strategy,  hypotheses  de¬ 
rived  from  low-resolution  features  limit  the  search  for  hypotheses  derived  from  high- 
resolution  features.  There  are  some  potential  difficulties  that  arise  when  applying 
coarse-fine  methods  in  conjunction  with  3D  object  models.  These  may  be  avoided 
by  using  view-based  alternatives  to  3D  object  modeling.  These  issues  are  discmssed 
more  fully  in  Chapter  4. 


Align  Refine  Verify 

The  recognition  strategy  advocated  in  this  work  may  be  summarized  as  align  rrfiiK 
verify.  This  approach  has  been  used  by  Lipson  [bO]  in  refining  alignments.  The  key 
observation  is  that  local  search  in  pose  space  may  be  used  to  refine  the  hypothesis 
from  the  alignment  stage  before  verification  is  carried  out.  In  hypothesize  and  test 
methods,  the  pose  estimates  of  the  initial  hypotheses  tend  to  be  somewhat  inaccurate, 
since  they  are  based  on  minimal  sets  of  corresponding  features.  Better  pose  estimates 
(hence,  better  verifications)  are  likely  to  result  from  using  all  supporting  image  feature 
data,  rather  than  a  small  subset.  Chapter  8  describes  a  method  that  refines  the  pose 
estimate  while  simultaneously  identifying  and  incorporating  the  constraints  of  all 
supporting  image  features. 
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1.3  Guide  to  Thesis 

Briefly,  tlie  presentation  of  the  material  in  this  thesis  is  essentially  bottoni-np.  The 
early  chapters  are  roncerned  witii  I'niilding  the  components  of  the  formulation,  while 
the  main  contributions,  the  statistical  formulations  of  object  recognition,  are  de¬ 
scribed  in  Chapters  6  and  7.  .After  that,  related  algorithms  are  described,  followed 
by  experiments  and  conclusions. 

In  more  detail.  Chapter  2  describes  the  probabilistic  models  of  the  correspon¬ 
dences.  or  mapping  between  image  features  and  features  belonging  to  either  the  ob¬ 
ject  or  to  the  background.  These  models  use  the  principle  of  maximum-entropy  where 
little  information  is  available  before  the  image  is  observed.  In  Chapter  3.  probabilis¬ 
tic  models  are  developed  that  characterize  the  feature  detection  process.  Empirical 
evidence  is  described  to  support  the  choice  of  model. 

Chapter  4  discusses  a  way  of  obtaining  average  object  edge  features  from  a  se¬ 
quence  of  observations  of  the  object  in  images.  Deterministic  models  of  the  projection 
of  features  into  the  image  are  discussed  in  ('hapter  5.  The  projection  methods  used 
in  this  work  are  linear  in  the  parameters  of  the  transformations.  Methods  for  2D  and 
3D  are  discussed,  including  the  Linear  Combination  of  Views  method  of  Cllman  and 
Basri  [71], 

In  Chapter  6  the  above  models  are  combined  in  a  Bayesian  framework  to  construct 
a  criterion.  .MAP  .Model  .Matching,  for  evaluating  hypotheses  in  object  recognition. 
In  this  formulation,  complete  hypotheses  consist  of  a  description  of  the  correspon¬ 
dences  bf’tween  image  and  object  features,  as  well  as  the  pose  of  the  object.  These 
hypotheses  are  evaluated  by  their  posterior  (after  the  image  is  observed)  probability. 
•A  recognition  experiment  is  described  that  uses  the  criteria  to  guide  a  heuristic  search 
over  correspondences.  A  connection  between  MAP  .Model  Matching  and  a  method  of 
robust  chamfer  matching  [4  7]  is  described. 

Building  on  the  above,  a  second  criterion  is  described  in  Chapter  7;  Posterior 
Marginal  f’ose  Est inml ion  (P.MF’E).  Here,  the  solution  being  sought  is  simpl\'  the 
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pose  of  the  object.  The  posterior  prol)al)ility  of  poses  is  ohtaiiieti  hv  lakiiia,  the 
formal  marginal,  over  all  possible  matches,  ot  the  posterior  probability  of  the  joint 
hypotheses  of  MAP  .Model  Matching.  This  results  in  a  smooth,  non-linear  objective 
function  for  evaluating  poses.  The  smoothness  of  the  objective  function  facilitates 
local  search  in  pose  space  as  a  mechanism  for  refining  hypotheses  in  recognition. 
Some  experimental  explorations  of  the  objective  function  in  pose  space  are  described. 
These  characterizations  are  carried  out  in  two  domains;  video  imagery  and  synthetic 
radar  range  imagery. 

Chapter  8  describes  use  of  the  the  Expectation-Maximization  (E.M)  algorithm  [21] 
for  finding  local  maxima  of  the  PMPE  objective  function.  This  algorithm  alternates 
iretween  the  M  step  -  a  weighted  least  squares  pose  estimate,  and  the  E  step  -  re¬ 
calculation  of  the  weights  based  on  a  saturating  non-linear  function  of  the  residuals. 

This  algo;  itlim  is  used  to  refine  and  evaluate  poses  in  2D  and  20  recognition  ex¬ 
periments  that  are  described  in  Chapter  10.  Initial  hypotheses  for  the  2D  experiments 
were  generated  by  a  simple  indexing  method.  Angle  Pair  Indexing,  that  is  described 
in  Chapter  9  .  The  Linear  Combination  of  V^iews  method  of  Ullman  and  Basri  [71]  is 
employed  as  the  projection  model  in  the  2D  experiments  reported  there. 

Finally,  some  conclusions  are  drawn  in  Chapter  1 1.  The  notation  used  throughout 
is  summarized  in  Appendix  A. 

I 
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Chapter  2 


Modeling  Feature  Correspondence 


This  chapter  is  concerned  with  probabilistic  models  of  feature  correspondences.  These 
models  will  serve  as  priors  in  the  statistical  theories  of  object  recognition  that  are 
described  in  Chapters  6  and  7.  and  are  important  components  of  those  formulations. 
They  are  used  to  assess  the  probability  that  features  correspond  before  the  image  data 
is  compared  to  the  object  model.  They  capture  the  expectation  that  some  features 
in  an  image  are  anticipated  to  be  due  to  the  object 

Three  different  models  of  feature  correspondence  are  described,  one  of  which  is 
used  in  the  recognition  experiments  described  in  Chapters  6,  7,  and  10. 

2.1  Features  and  Correspondences 

This  research  foc\ises  on  feature-based  object  recognition.  The  object  being  sought 
and  the  image  being  analyzed  consist  of  discrete  features. 

Let  the  image  that  is  to  be  analyzed  be  represented  by  a  set  of  r-dimensional 
point  features 

1  =  . }„}  , 

Image  features  are  discussed  in  more  detail  in  Chapters  :i  and  .'). 
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The  object  to  be  recognized  is  also  described  by  a  set  of  features. 

M  =  . . 

The  features  will  usually  be  represented  by  real  matrices.  .Additional  details  on  object 
features  appears  in  Chapters  4  and  5. 

In  this  work,  the  interpretation  of  the  features  in  an  image  is  represented  Ity  the 
variable  T.  which  describes  the  mapping  from  image  features  to  object  features  or  the 
scene  background.  This  is  also  referred  to  as  the  correspondences. 

r  =  {r,,r, . r,.}  ,  r.  e.v/u{i.}  . 

In  an  interpretation,  each  image  feature,  V',,  will  be  assigned  either  to  some  object 
feature  .T/j.  or  to  the  background,  which  is  denoted  by  the  symbol  ±.  This  symbol 
plays  a  role  similar  to  that  of  the  null  character  in  the  interpretation  trees  of  CJrimson 
and  Lozano-Perez  [36].  .An  interpretation  is  illustrated  in  Figure  2-1.  F  is  a  collection 
of  variables  that  is  indexed  in  parallel  with  the  image  features.  Each  variable  F, 
represents  the  assignment  of  the  corresponding  image  feature  V,.  It  may  take  on  as 
value  any  of  the  object  features  .V/_,,  or  the  background,  J..  Thus,  the  iiieaning  of  the 
expression  F5  =  Mq  is  that  image  feature  five  is  assigned  to  object  feature  six.  likewise 
F7  means  that  image  feature  seven  has  been  assigned  to  the  background.  In  an 
interpretation  each  image  feature  is  assigned,  while  some  object  features  may  not  l)e. 
•Additionally,  several  image  features  may  be  a.ssigned  to  the  same  object  feature,  d  his 
representation  allows  image  interpretations  that  are  implausible  -  other  mechanisms 
are  used  to  encourage  metrical  consistency. 
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2.2  An  Independent  Correspondence  Model 


In  this  sn'l'iou  a  simple  pruhahili.stir  iiuxiel  of  correspumienees  is  liescrihed.  The 
intent  is  to  capture  some  information  liearing  on  correspondences  before  the  image  is 
compared  to  the  object.  Tliis  model  has  been  designed  to  be  a  reasonable  compromise 
between  simplicity  and  accuracy. 

In  this  model,  the  correspondence  status  of  differing  image  features  are  assumed 
to  be  independent,  so  that 

/dT)  =  ’ 

t 

Here,  pfP)  is  a  probability  mass  function  on  the  discrete  variable  T.  There  is 
evidence  against  using  statistical  independence  here,  for  example,  occlusion  is  locally 
correlated.  Independence  is  used  as  an  engineering  approximation  that  simplifies  the 
resulting  formulations  of  recognition.  It  may  be  justified  by  the  good  performance 
of  the  recognition  experiments  described  in  Chapters  6.  7.  and  10.  Few  recognition 
systems  have  used  non-independent  models  of  correspondence.  Breuel  outlined  one 
approach  in  his  thesis  [9].  A  relaxation  of  this  assumption  is  discussed  in  the  following 
section. 

The  component  probability  function  is  designed  to  characterize  the  amount  of 
clutter  in  the  image,  but  to  be  otherwise  as  non-committal  as  pcssilile: 


j>(  F ,  ) 


H  if  F.  =J. 
otherwise 


I 


The  joint  model  pfT)  is  the  maximum  entropy  probability  function  that  is  con¬ 
sistent  with  the  constraint  that  the  probability  of  an  image  feature  belonging  to  the 
background  is  B.  B  may  be  estimated  by  taking  simple  statistics  on  images  from  the 
domain.  B  =  .9  would  mean  that  90  Vi  of  image  features  are  expected  to  be  due  ti) 
the  background. 

Having  B  constant  during  recognition  is  an  approximation.  The  number  of  fea- 
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tiires  due  to  the  object  will  likely  vary  according  to  the  size  of  the  ol)je<'t  in  the  scene. 
B  could  be  estimated  at  recognition  time  by  pre-processing  mechanisms  that  evaluate 
image  clutter,  and  factor  in  expectations  about  the  size  of  the  object.  In  practice, 
the  approximation  works  well  in  controlled  situations. 

The  independent  correspondence  model  is  used  in  the  experiments  reported  in 
this  research. 

2.3  A  Markov  Correspondence  Model 

•As  indicated  above,  one  inaccuracy  of  the  independent  correspondence  model  is  that 
sample  realizations  of  F  drawn  from  the  probability  function  of  Equations  2.1  and 
2.2  will  tend  to  be  overly  fragmented  in  their  modeling  of  occlusion.  This  section 
describes  a  compromise  model  that  relaxes  the  independence  assumption  somewhat 
by  allowing  the  correspondence  status  of  an  image  feature  (F,)  to  depend  on  that  of 
its  neighbors.  In  the  domain  of  this  research,  image  features  are  fragments  of  image 
edge  curves.  These  features  have  a  natural  neighbor  relation,  adjacency  along  the 
image  edge  curve,  that  may  be  used  for  constructing  a  ID  Markov  Random  Field 
(.MRF)  model  of  correspondences.  MRF’s  are  collections  of  random  variables  whose 
conditional  dependence  is  restricted  to  limited  size  neighborhoods.  MRF  models  are 
discussed  by  Cleman  and  Cleman  [28].  The  following  describes  an  MRF  model  of 
correspondences  intended  to  provide  a  more  accurate  model  of  occlusion. 


p{T) 


7(F,)r/(r,)...r/(r„)r,(r,,r,)c,(F,.r:,). 


^  11— 1  ( F,, _  1 .  F,, ) 


(2.:r) 


where 


c,  if  F.  =JL 

< 

(2  otherwise 


(2.4) 
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aiul 


'■.(«.  h) 


f.-j  if  a  =±  and  b  =± 

(4  if  n  and  6  ^_L  ' 

f,5  otherwise 


if  features  i  and  i  +  1  are  uei2;iil)urs 


otherwise  . 


The  assignment  of  indices  to  image  features  should  be  done  in  such  a  way  that 
neighboring  features  have  adjacent  indices.  The  functions  mode!  the  interac¬ 

tion  of  neighboring  features.  The  parameters  ei  ...65  may  be  adjusted  so  that  the 
probability  function  pjF)  is  consistent  with  observed  statistics  on  clutter  and  fre¬ 
quency  of  adjacent  occlusions.  Additionally,  the  parameters  must  be  constrained  so 
that  Equation  2.3  actually  describes  a  probability  function.  When  these  constraints 
are  met,  the  model  will  be  the  maximum  entropy  probability  function  consistent  with 
the  constraints.  Satisfying  the  constraints  is  a  non-trivial  selection  problem  that  may 
be  approached  iteratively.  Fortunately,  this  calculation  doesn't  need  to  be  carried  out 
at  recognition  time.  Goldman  [30]  discusses  methods  of  calculating  tliese  parameters. 

The  model  outlined  in  Equations  2.3  -  2.5  is  a  generalization  of  the  Ising  spin 
model.  Ising  models  are  used  in  statistical  physics  to  model  ferromagnetism  [73]. 
Samples  drawn  from  Ising  models  exhibit  spatial  clumping  whose  scale  depends  on 
the  parameters.  In  object  recognition,  this  clumping  behavior  may  provide  a  more 
accurate  model  of  occlusion. 


The  standard  Ising  model  is  shown  for  reference  in  the  following  equations.  It  has 
been  restricted  to  ID.  and  has  been  adapted  to  the  notation  of  this  section. 


<7,  €  {  —  1.1} 

P((Ji(72  .  .  .rr,A  =  ■^C/((7,)r/(fT2)  •  •  -'/(c^n)  '•('7i.<T2)7'((T2.n-;i)  •  •  •  7' ( fT,,  _ , .  O’,, ) 
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7(«)  =  < 


ri(i,b)  =  ^ 


exp(ff) 

exp(-^) 

expCjlj) 

exp(-^) 


if  a  =  1 
otherwise 

'\{  a  =  b 
otherwise  . 


Here.  Z  is  a  tiornialization  roiistant,  /z  is  the  tiiomeiit  of  the  magiietie  dipoles. 
H  is  the  strength  of  the  applied  magnetic  field,  fc  is  Boltzmann's  constant.  T  is 
temperature,  and  J  is  a  neighbor  interaction  constant  called  the  exchange  energy. 

The  approach  to  modeling  correspondences  that  is  described  in  this  .section  was 
outlined  in  Wells  [74]  [75].  Subsequently,  Breue!  [9]  described  a  similar  local  interac¬ 
tion  model  of  occlusion  in  conjunction  with  a  simplified  statistical  model  of  recognition 
that  used  boolean  features  in  a  classification  ba.sed  scheme. 

The  Markov  correspondence  model  is  not  used  in  the  experiments  reported  in  this 
research. 


2.4  Incorporating  Saliency 

•Another  route  to  more  accurate  modeling  of  correspondences  is  to  exploit  bottom-up 
saliency  processes  to  suggest  which  image  features  are  most  likely  to  correspond  to 
the  object.  One  such  process  in  described  by  Ullman  and  Shashua  [66]. 

For  concreteness,  assume  that  the  saliency  process  provide  a  per-feature  measure 
of  saliency,  S,.  To  incorporate  this  information,  we  construct  pjF,  =J_|  SA-  This  may 
be  conveniently  calculated  via  Bayes’  rule  as  follows; 

^  ^  MS.|r,  =x)„(r.=i) 

p(.'4,  I  r,  =-L)  and  p(.S’,)  are  probability  densities  that  may  be  estimated  from 
observed  frequencies  in  training  data.  As  in  Section  2.2.  we  .set  pjF,  =1.)  =  B. 
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A  feature  specific  background  probability  may  then  be  defined  as  follows; 


B.  =  p(r,  =±|  .s)  = 


In  this  case  the  complete  probability  function  on  F,  will  be 


P(^,)  = 


B.  if  r. 

otherwise 


(2.h) 


This  model  is  not  used  in  the  experiments  described  in  this  research. 


2.5  Conclusions 

The  simplest  of  the  three  models  described,  the  independent  correspondence  model, 
has  been  used  to  good  effect  in  the  recognition  experiments  described  in  Chapters  6.  7. 
ami  10.  In  some  domains  additional  robustness  in  recognition  might  residt  from  using 
either  the  Markov  correspondence  model,  or  by  incorporating  saliency  information. 


Chapter  3 

Modeling  Image  Features 

Probabilistic  models  of  image  features  are  the  topic  of  this  chapter.  These  are  an¬ 
other  important  component  of  the  statistical  theories  of  object  recognition  that  are 
described  in  (Chapters  6  and  7. 

The  probability  density  function  for  the  coordinates  of  image  features,  conditioned 
on  correspondences  and  pose,  is  defined.  The  PDF  has  two  important  cases,  depend¬ 
ing  on  whether  the  image  feature  is  cissigned  to  the  object,  or  to  the  background. 
Features  matched  to  the  object  are  modeled  with  normal  densities,  while  uniform 
densities  are  used  for  background  features.  Empirical  evidence  is  provided  to  support 
the  use  of  normal  densities  for  matched  f.*^.tures.  A  form  of  stationarity  is  described. 

.Many  recognition  systems  implicitly  .se  uniform  densities  (rather  than  normal 
densities)  to  model  matched  image  features  {bounded  error  models).  The  empirical 
evidence  of  Section  3.2.1  indicates  that  the  normal  model  may  sometimes  be  better. 
Because  of  this,  use  of  normal  models  may  provide  better  performance  in  recognition. 
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3.1  A  Uniform  Model  for  Background  Features 

The  image  features.  V’,,  are  v  dimensional  vectors.  When  assigned  to  the  background, 
they  are  assumed  to  be  uniformly  distributed. 

(The  PDF  is  defined  to  be  zero  outside  the  coordinate  space  of  the  image  features, 
which  has  extent  IT,  along  dimension  i.)  F  describes  the  correspondences  from  image 
features  to  object  features,  and  ,i  describes  the  position  and  orientation,  or  pose  of 
the  object.  For  example,  if  the  image  features  are  2D  points  in  a  640  by  480  image, 
then  p{y,  )±,  6)  =  ,  within  the  image.  For  Vj,  this  probability  function  depends 

only  on  the  i’th  component  of  F. 

Providing  a  satisfying  probability  density  function  for  background  features  is  prob¬ 
lematical.  Equation  3.1  describes  the  maximum  entropy  PDF  consistent  with  the 
constraint  that  the  coordinates  of  image  features  are  always  expected  to  lie  within 
the  coordinate  space  of  the  image  features.  E.T.  .Jaynes  [46]  has  argued  that  maxi¬ 
mum  entropy  distributions  are  the  most  honest  representation  of  a  state  of  incomplete 
know'ledge. 


3.2  A  Normal  Model  for  Matched  Features 

Image  features  that  are  matched  to  oViject  features  are  assumed  to  be  normally  dis¬ 
tributed  about  their  predicted  position  in  the  image. 

p(\-  1  r.  J)  =  V'  -  V{M,,  3))  if  r,  =  M,  .  (3.2) 

Here  Y,.  F.  and  3  are  defined  as  above. 

is  the  e-dimensional  Gaussian  probability  density  function  with  covariance 
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Figure  3-1:  Fine  Image  Features  and  Fine  Model  Features 


matrix 

=  (27r)-2|0,j|-2  exp(-^x^0-’x)  . 

The  covariance  matrix  is  discussed  more  fiilly  in  Section  3.3. 

When  r,  =  A/j,  the  predicted  coordinates  of  image  feature  >'  are  given  by 
V(MjJi),  the  projection  of  object  feature  j  into  the  image  with  object  pose  F‘ro- 
jection  and  pose  are  discussed  in  more  detail  in  (diapter  5. 

3.2.1  Empirical  Evidence  for  the  Normal  Model 

This  section  describes  some  empirical  evidence  from  the  domain  of  video  image  edge 
features  indicating  that  normal  probability  densities  are  good  models  of  feature  fluc¬ 
tuations.  and  that  they  can  be  better  than  uniform  probability  densities.  The  ev¬ 
idence  is  provided  in  the  form  of  observed  and  fitted  cumulative  distributions  and 
Kolmogorov-Smirnov  tests.  The  model  distributions  were  fitted  to  the  data  using  the 
Maximum  Likelihood  method. 

The  data  that  is  analyzed  are  the  perpendicular  and  parallel  deviations  of  fine 
and  coarse  edge  features  derived  from  video  images.  The  fine  and  coarse  features  are 
shown  in  Figures  3-1  and  3-3  respectively. 

The  model  features  are  from  Mean  Edge  Images,  these  are  described  in  Section 
4.4.  The  edge  operator  used  in  obtaining  the  image  features  is  ridges  in  the  magnitude 
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Figure  3-3:  ('oarse  Image  Features  and  Coarse  Model  Features 


.4  .\ORMAL  AJODEL  FOR  MATC'HED  FEATl'RES 


Figure  3-4:  Coarse  Feature  Correspoudences 
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of  the  image  gradient,  as  discussed  in  Sertion  4.4.  The  smoothing  standard  de\iatiuii 
used  in  the  edge  detection  was  2.0  and  4.0  pixels  respectively,  for  the  fine  and  coarse 
features.  These  features  were  al.so  used  in  the  experiments  reported  in  .Section  10.1. 
and  the  correspondences  were  used  there  as  training  data. 

For  the  analysis  in  this  section,  the  feature  data  consists  of  the  average  of  the 
X  and  y  coordinates  of  the  pixels  from  edge  curve  fragments  -  they  are  2D  point 
features.  The  features  are  displayed  as  circular  arc  fragments  for  clarity.  The  . 
curves  were  broken  arbitrarily  into  10  and  20  pixel  fragments  for  the  fine  and  d^aise 
features  respectively. 

Correspondences  from  image  features  to  model  features  were  established  by  a 
neutral  subject  using  a  mouse.  These  correspondences  are  indicated  by  heavy  lines 
in  Figures  3-2  and  3-4.  Perpendicular  and  parallel  deviations  of  the  corresponding 
features  were  calculated  with  respect  to  the  normals  to  edge  curves  at  the  image 
features. 

Figure  3-5  shows  the  cumulative  distributions  of  the  perpendicular  and  parallel 
deviations  of  the  fine  features.  The  cumulative  distributions  of  fitted  normal  densities 
are  plotted  as  heavy  dots  over  the  observed  distributions.  The  distributions  were  fitted 
to  the  data  using  the  Maximum  Likelihood  method  -  the  mean  and  variance  of  the 
normal  density  are  set  to  the  mean  and  variance  of  the  data.  These  figures  show  good 
agreement  between  the  observed  distributions,  and  the  fitted  normal  distributions. 
Similar  observed  and  fitted  distributions  for  the  coarse  deviations  are  shown  in  Figure 
3-6.  again  with  good  agreement. 

The  observed  cumulative  distributions  are  shown  again  in  Figures  .3-7  and  3-S. 
this  time  with  the  cumulative  distributions  of  fitted  uniform  densities  over-plotted 
in  heavy  dots.  As  before,  the  uniform  densities  were  fitted  to  the  data  using  the 
Maximum  Likelihood  method  -  in  this  ca.se  the  uniform  densities  are  adjusted  ti)  just 
include  the  extreme  data.  These  figures  show  relatively  poor  agreement  between  the 
observed  and  fitted  distributions,  in  comparison  to  normal  densities. 


Figure  .'F7:  Observed  Cumulative  Distributions  and 
tributions  for  Fine  Features 


3.2.  A  XORMAL  MODEL  FOR  MATCHED  FE.ATI  RES 


Deviate 

N 

Normal  Hypothesis 

Uniform  Hypothesis 

Do 

P{D  >  Do) 

Do 

PiD>  Do) 

Fine  Perpendicular 

118 

.0824 

.3996 

.2244 

.000014 

Fine  Parallel 

118 

.0771 

.4845 

.1596 

.0049 

(loarse  Perpendicular 

•28 

.15‘26 

.5317 

.•2518 

.0574 

( 'oarse  Parallel 

28 

.0948 

.9628 

.1.543 

.5172 

Table  3.1:  Kolniogorov-Smirnov  Tests 


Kolmogorov-Smirnov  Tests 


The  Kohnogorov-Sniirnov  (KS)  test  [59]  is  one  way  of  analyzing  the  agreement  be¬ 
tween  observed  and  fitted  cumulative  distributions,  such  as  the  ones  in  Figures  ;}-5 
to  .'3-d.  The  K.S  test  is  computed  on  the  magnitude  of  the  largest  difference  between 
the  observed  and  hypothesized  (fitted)  distributions.  This  will  be  referred  to  as  D. 
The  probability  distribution  on  this  distance,  under  the  hypothesis  that  the  data  were 
drawn  from  the  hypothesized  distribution,  can  be  calculated.  An  asymptotic  formula 
is  given  by 

P(D  >  DA  =  Q{VNDo) 

where 

OO 

Q(-r)  =  ' 

j=i 

and  Da  is  the  observed  value  of  D. 

The  results  of  K.S  tests  of  the  consistency  of  the  data  with  fitted  normal  and 
uniform  distributions  are  shown  in  Table  3.1.  Low  values  of  L‘[D  >  Da)  suggest 
incompatibility  between  the  data  and  the  hypothesized  distribution.  In  the  cases 
of  fine  perpendicular  and  parallel  deviations,  and  coarse  perpendicular  deviations, 
refutation  of  the  uniform  model  is  strongly  indicated.  Strong  contradictions  of  the 
fitted  normal  models  are  not  indicated  in  any  of  the  cases. 
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3.3  Oriented  Stationary  Statistics 

Tlie  covariance  matrix  ip,j  that  appears  in  tlie  model  of  matched  image  features  in 
Equation  3.2  is  allowed  to  depend  on  both  the  image  feature  and  the  object  feature 
involved  in  the  correspondence.  Indexing  on  i  allows  dependence  on  the  image  feature 
detection  process,  while  indexing  in  j  allows  dependence  on  the  identity  of  the  model 
feature.  This  is  useful  when  some  model  features  are  know  to  be  noisier  than  others. 
This  flexibility  is  carried  through  the  formalism  of  later  chapters.  Although  such  flex¬ 
ibility  can  be  useful,  substantial  simplification  results  by  assuming  that  the  features 
statistics  are  stationary  in  the  image,  i.e.  tp,j  =  for  all  ij.  This  could  be  reason¬ 
able  if  the  feature  fluctuations  were  isotropic  in  the  image,  for  example.  In  its  strict 
form  this  assumption  may  be  too  limiting,  however.  This  section  outlines  a  compro¬ 
mise  approach,  oriented  stationary  statistics,  that  was  used  in  the  implementations 
described  in  Chapters  6,  7,  and  8. 

This  method  involves  attaching  a  coordinate  system  to  each  image  feature.  The 
coordinate  system  has  its  origin  at  the  point  location  of  the  feature,  and  is  oriented 
with  respect  to  the  direction  of  the  underlying  curve  at  the  feature  point.  When 
(stationary)  statistics  on  feature  deviations  are  measured,  they  are  taken  relative  to 
these  coordinate  systems. 


3.3.1  Estimating  the  Parameters 

The  experiments  reported  in  Sections  6.2,  7.1,  and  Chapter  10  use  the  normal  model 
and  oriented  stationary  statistics  for  matched  image  features.  After  this  choice  of 
model,  it  is  still  necessary  to  supply  the  specific  parameters  for  the  model,  namely, 
the  covariance  matrices,  V’.j,  of  the  normal  densities. 

The  parameters  were  estimated  from  observations  on  matches  dune  by  hand  on 
sample  images  from  the  domain.  Because  of  the  stationarity  assumption  it  is  possible 
to  estimate  the  common  covariance,  v,  by  observing  match  data  on  one  image.  For 
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this  purpose,  a  match  was  done  with  a  mouse  between  features  from  a  Mean  Ed^e  Im¬ 
age  (these  are  described  in  Section  4.4)  and  a  representative  image  from  the  domain. 
During  this  process,  the  pose  of  the  object  was  the  same  in  the  two  images.  This 
produced  a  set  of  corresponding  edge  features.  For  the  sake  of  e.xample,  the  process 
will  be  described  for  2D  point  features  (described  in  Section  The  procedure  has 

also  been  used  with  2D  point-radius  features  and  2D  oriented-range  features,  that  are 
described  in  Sections  .5.3  and  5.4  respectively. 

Let  the  observed  image  features  be  described  by  K,,  and  the  corresponding  mean 
model  features  by  Vj.  The  observed  residuals  between  the  “data”  image  features,  and 
the  “mean”  features  are  A;  =  Vj  —  V,. 

The  features  are  derived  from  edge  data,  and  the  underlying  edge  curve  has  an 
orientation  angle  in  the  image.  These  angles  are  u.sed  to  define  coordinate  systems 
specific  to  each  image  feature  V,.  These  coordinate  systems  define  rotation  matrices 
ft,  that  are  used  to  transform  the  residuals  into  the  coordinate  systems  of  the  features, 
in  the  following  way:  A'  =  ft, A,. 

The  stationary  covariance  matrix  of  the  matched  feature  fluctuations  observed 
in  the  feature  coordinate  systems  is  then  estimated  using  the  Maximum  Likelihood 
method,  as  follows. 


>T 


Here  T  denotes  the  matrix  transpose  operation.  This  technique  has  some  bias,  Imt 
for  the  reasonably  large  sample  sizes  involved  (n  «  100)  the  effect  is  minor. 

The  resulting  covariance  matrices  typically  indicate  larger  variance  for  deviations 
along  the  edge  curve  than  perpendicular  to  it,  as  suggested  by  the  data  in  Figures 
3-5  and  3-6. 
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3.3.2  Specializing  the  Covariance 

At  recognition  time,  it  is  necessary  to  specialize  the  constant  covariance  to  each  image 
feature.  This  is  done  by  rotating  it  to  orient  it  with  respect  to  the  image  feature. 

A  covariance  matrix  transforms  like  the  following  product  of  residuals: 


a;  A 


iT 


This  is  transformed  back  to  the  image  system  as  follows, 


Thus  the  constant  covariance  is  specialized  to  the  image  features  in  the  following  way. 


Chapter  4 

Modeling  Objects 

What  is  needed  from  object  models?  For  recognition,  the  main  issue  lies  in  predicting 
the  image  features  that  will  appear  in  an  image  of  the  object.  Should  the  object  model 
be  a  monolithic  3D  data  structure?  After  all,  the  object  itself  is  3D.  In  this  chapter, 
some  pros  and  cons  of  monolithic  3D  models  are  outlined.  An  alternative  approach, 
interpolation  of  views,  is  proposed.  The  related  problem  of  obtaining  the  object 
model  data  is  discussed,  and  it  is  proposed  that  the  object  model  data  be  obtained 
by  taking  pictures  of  the  object.  An  automatic  method  for  this  purpose  is  described. 
Additionally,  a  means  of  edge  detection  that  captures  the  average  edges  of  an  object 
is  described. 

4.1  Monolithic  3D  Object  Models 

One  motivation  for  using  3D  object  models  in  recognition  systems  is  the  observation 
that  roinputer  graphics  techniques  can  be  used  to  synthesize  convincing  images  from 
3D  models  in  any  pose  desired. 

For  some  objects,  having  a  single  3D  model  seems  a  natural  choice  for  a  recognition 
system.  If  the  object  is  polygonal,  and  is  represented  by  a  list  of  3D  line  segments  and 
vertices,  then  predicting  the  features  that  will  appear  in  a  given  high  resolution  view 
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is  a  simple  matter.  All  that  is  needed  is  to  apply  a  pose  dependent  transformation  to 
each  feature,  and  to  perform  a  visibility  test. 

For  other  objects,  such  as  smoothly  curved  objects,  the  situation  is  different.  Pre¬ 
dicting  features  becomes  more  elaborate.  In  video  imagery,  occluding  edges  (or  limbs] 
are  often  important  features.  Calculating  the  limb  of  a  smooth  .3D  surface  is  usually 
complicated.  Ponce  and  Kriegman  [.58]  describe  an  approach  for  objects  modeled 
by  parametric  surface  patches.  Algebraic  elimination  theory  is  used  to  relate  image 
limbs  to  the  model  surfaces  that  generated  them.  Brooks’  vision  system.  Acronym 
[10],  also  recognized  curved  objects  from  image  limbs.  It  used  generalized  cylinders 
to  model  objects.  A  drawback  of  this  approach  is  that  it  is  awkward  to  realistically 
modeling  typical  objects,  like  telephones  or  automobiles,  with  generalized  cylinders. 

Predicting  reduced  resolution  image  features  is  another  difficulty  with  monolithic 
3D  models.  This  is  a  drawback  because  doing  recognition  with  reduced  resolution 
features  is  an  attractive  strategy;  with  fewer  features  less  search  will  be  needed.  One 
solution  would  be  to  devise  a  way  of  smoothing  3D  object  models  such  that  simple 
projection  operations  would  accurately  predict  reduced  resolution  edge  features.  No 
such  method  is  known  to  the  author. 

Detecting  reduced  resolution  image  features  is  straightforward.  Good  edge  fea¬ 
tures  of  this  sort  may  be  obtained  by  smoothing  the  grayscale  image  before  using  an 
edge  operator.  This  method  is  commonly  used  with  the  Canny  edge  operator  [13]. 
and  with  the  Marr-Hildreth  operator  [53]. 

An  alternative  approach  is  to  do  projections  of  the  object  model  at  full  resolution, 
and  then  to  do  some  kind  of  smoothing  of  the  image.  It  isn't  clear  what  sort  of 
smoothing  would  be  needed.  One  possibility  is  to  do  photometrically  realistic  projec¬ 
tions  (for  example  by  ray  tracing  rendering),  perform  smoothing  in  the  image,  and 
then  use  the  same  feature  detection  scheme  as  is  used  on  the  images  presented  for 
recognition.  This  method  is  likely  to  be  too  expensive  for  practical  recognition  system 
that  need  to  perform  large  amounts  of  prediction.  Perhaps  better  ways  of  doing  this 
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will  be  found. 

Self  occlusion  is  an  additional  complexity  of  the  monolithic  3D  model  approach. 
In  computer  graphics  there  are  several  ways  of  dealing  with  this  issue,  among  them 
hidden  line  and  z-buffer  methods.  These  methods  are  fairly  expensive,  at  least  in 
comparison  to  sparse  point  projections. 

In  summary,  monolithic  3D  object  models  address  some  of  the  requirements  for 
predicting  images  for  recognition,  but  the  computational  cost  may  be  high. 

4.2  Interpolation  of  Views 

One  approach  to  avoiding  the  difficulties  dis»  ussed  in  the  previous  section  is  to  use  an 
image-based  approach  to  object  modeling.  Ullman  and  Basri  [71]  have  discussed  such 
approaches.  There  is  some  biological  evidence  that  animal  vision  systems  have  rerog 
nition  subsystems  that  are  attuned  to  specific  views  of  faces  [25|.  This  may  provide 
some  assurance  that  image-based  approaches  to  recognition  aren’t  unreasonable. 

An  important  issue  with  image-based  object  modeling  concerns  how  to  predict 
image  features  in  a  way  that  covers  the  space  of  poses  that  the  object  may  assume. 

Bodies  undergoing  rigid  motion  in  space  have  six  degrees  of  freedom,  three  in 
translation,  and  three  in  rotation.  This  six  parameter  pose  space  may  be  split  into  two 
parts  -  the  first  part  being  translation  and  in  image-plane  rotations  (four  parameters) 
-  the  second  part  being  out  of  image-plane  rotations  (two  parameters:  the  “view 
sphere”). 

.Synthesizing  views  of  an  object  that  span  the  first  part  of  pose  space  can  often 
be  done  using  simple  and  efficient  linear  methods  of  translation,  rotation,  and  scale 
in  the  plane.  This  approach  can  be  precise  under  orthographic  projection  with  scal¬ 
ing.  and  accurate  enough  in  some  domains  with  perspective  projection.  Perspective 
projection  is  often  approximated  in  recognition  systems  by  3D  rotation  combined 
with  orthographic  projection  and  scaling.  This  has  been  called  the  weak  perspective 
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approximation  [70]. 

The  second  part  of  pose  space,  out  of  plane  rotation,  is  more  complicated.  The 
approach  advocated  in  this  research  involves  tesselating  the  view  sphere  around  the 
object,  and  storing  a  view  of  uhe  object  for  each  vertex  of  tiie  lesselaiion.  Aibitiary 
views  will  then  entail,  at  most,  small  out  of  plane  rotations  from  stored  views.  These 
views  may  be  synthesized  using  interpolation.  The  Linear  Combination  of  Views 
method  of  Ullman  and  Basri  [71],  works  well  for  interpolating  between  nearby  views 
(and  more  distant  ones,  as  well). 

Conceptually,  the  interpolation  of  views  method  caches  pre-computed  predictions 
of  images,  saving  the  expense  of  repeatedly  computing  them  during  recognition.  If 
the  tesselation  is  dense  enough,  difficulties  owing  to  large  changes  in  aspect  may  be 
avoided. 

Breuel  [9]  advocates  a  view-based  approach  to  modeling,  without  interpolation. 


4.3  Object  Models  from  Observation 

How  can  object  model  features  be  acquired  for  use  in  the  interpolation  of  views 
framework?  If  a  detailed  CAD  model  of  the  object  is  available,  then  views  might  be 
synthesized  using  graphical  rendering  programs  (this  approach  was  used  in  the  (single 
view)  laser  radar  experiment  described  in  Section  7.3). 

Another  method  is  to  use  the  object  itself  as  its  own  model,  and  to  acquire  views 
by  taking  pictures  of  the  object.  This  process  can  make  use  of  the  feature  extraction 
method  that  is  used  on  images  at  recognition  time.  An  advantage  of  this  scheme  is 
that  an  accurate  CAD  style  model  isn't  needed.  Using  the  run-time  feature  extraction 
mechanism  of  the  recognition  system  automatically  selects  the  features  that  will  be 
salient  at  recognition  time,  which  is  otherwise  a  potentially  difficult  problem. 

One  difficulty  with  the  models  from  observation  approach  is  that  image  features 
tend  to  be  somewhat  unstable.  For  example,  the  presence  and  location  of  edge  features 
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is  iuHiieiio'd  l)y  illinniiiatioii  roiulitious,  as  illustrated  in  the  following  figures.  Fi^  ue 
4-1  shows  a  series  of  nine  grayscale  images  where  the  only  variation  is  in  lighting. 
corresponding  set  of  edge  images  is  shown  in  4-2.  The  edge  operator  used  in  preparing 
the  images  is  described  in  Section  4  4.  The  standard  deviation  of  tlie  smoothing 
operator  was  2  pixels. 


4.4  Mean  Edge  Images 

It  was  pointed  out  above  that  the  instability  of  edge  features  is  a  potential  difficulty 
of  acquiring  object  model  features  from  observation.  The  Mean  Edge  Image  method 
solves  this  problem  by  itiaking  edge  maps  that  are  averaged  over  variations  due  to 
illumination  changes. 

Brightness  edges  may  be  characterized  as  the  ridges  of  a  measure  of  brightness 
variation.  This  is  consistent  with  the  common  notion  that  edges  are  the  ID  loci  of 
maxima  of  changes  in  brightness.  The  edge  operator  used  in  Figure  4-2  is  an  example 
of  this  style  of  edge  detector.  It  is  a  ridge  operator  applied  to  the  squared  discrete 
gradient  of  smoothed  images.  Here,  the  squared  discrete  gradient  is  the  measure  of 
brightness  variation.  This  style  of  edge  detection  was  described  by  Mercer  [57].  The 
mathematical  definition  of  the  ridge  predicate  is  that  the  gradient  is  perpendicular  to 
the  direction  having  the  most  negative  second  directional  derivative.  Another  similar 
definition  of  edges  was  proposed  Haralick  [37].  For  a  general  survey  of  edge  detection 
methods,  see  Robot  Vision,  by  Horn  [39] . 

The  preceding  characterization  of  image  edges  generalizes  naturally  to  mean  edges. 
Mean  edges  are  defined  to  be  ridges  in  the  average  measure  of  brightness  fluctuation. 
In  this  work,  average  brightness  fluctuation  over  a  set  of  pictures  is  obtained  In- 
averaging  the  squared  discrete  gradient  of  the  (smoothed)  images. 

Figure  4-3  shows  the  averaged  squared  gradient  of  smoothed  versions  of  the  images 
that  appear  in  Figure  4- 1 .  Recall  that  only  the  lighting  changed  between  these  images. 
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Figure  4-2:  Edge  Images 
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Figure  4-3:  Averaged  Squared  CJradient  of  Smoothed  Images 


Figure  4-4  shows  the  ridges  from  the  image  of  Figure  4-3.  Hysteresis  thresholding 
based  on  the  magnitude  of  the  averaged  squared  gradient  has  been  used  to  suppress 
weak  edges.  Such  hysteresis  thresholding  is  u.sed  with  the  Canny  edge  operator.  Note 
that  this  edge  image  is  relatively  immune  to  specular  highlights,  in  comparison  to  the 
individual  edge  images  of  Figure  4-4. 


4.5  Automatic  3D  Object  Model  Acquisition 

This  section  outlines  a  method  for  automatic  3D  object  model  ficquisition  that  com¬ 
bines  interpolation  of  views  and  Mean  Edge  Images.  The  method  involves  automati¬ 
cally  acquiring  (many)  pictures  of  the  object  under  various  combinations  of  pose  ami 
illumination.  A  preliminary  implementation  of  the  method  was  used  to  accpiire  object 
model  features  for  the  3D  recognition  experiment  discussed  in  .Section  10.4. 
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Figure  4-5;  A  fVntaki.s  Doderahedrcui 

The  object,  a  plastic  car  model,  was  mounted  on  the  tool  tiange  of  a  F’l'MA  5()0 
robot.  A  video  camera  connected  to  a  Sun  Microsystems  VFC  video  digitizer  was 
mounted  near  the  robot. 

For  the  purpose  of  Interpolation  of  Views  object  model  coiistruction.  the  view 
sphere  around  the  object  was  tesselated  into  :i2  view  points,  the  vertices  of  a  pentakis 
dodecahedron  (one  is  illustrated  in  Figure  4-5).  .At  each  view  point  a  "canonical  pose" 
tor  the  object  was  constructed  that  oriented  the  view  point  towards  the  camera,  while 
keeping  the  center  of  the  object  in  a  fixed  position. 

.Vine  different  configurations  of  lighting  were  arranged  for  the  purpose  of  C(U)- 
structing  Mean  Edge  Images.  The  lighting  configurations  were  made  by  moving  a 
spotlight  to  nine  different  position  that  illuminated  the  object.  The  lamp  positions 
roughly  covered  the  view  hemisphere  centered  on  the  camera. 

The  object  was  move^l  to  the  canonical  poses  corresponding  to  the  21  vertices  in 
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the  upper  part  (roughly  2/3)  of  the  object's  view  sphere.  At  each  of  these  poses, 
pictures  were  taken  with  each  of  the  nine  lamp  positions. 

Mean  Edge  Images  at  various  scales  of  smoothing  were  constructed  for  each  of 
the  canonical  poses.  Object  model  features  for  recognition  experiments  described  in 
Chapter  8  were  derived  from  these  Mean  Edge  Images.  Twenty  of  the  images  from 
one  such  set  of  Mean  Edge  Images  are  displayed  in  Figures  4-b  and  4-7. 

Two  of  these  Mean  Edge  Images  were  used  in  an  experiment  in  3D  recognition 
using  a  two- view  Linear  Combination  of  Views  method.  This  method  requires  corre¬ 
spondences  among  features  at  differing  views.  These  correspondences  were  established 
by  hand,  using  a  mouse. 

It  is  likely  that  such  feature  correspondence  could  be  derived  from  the  results 
of  a  motion  program.  Shashua’s  motion  program  [65],  which  combines  geometry 
and  optical  flow,  was  tested  on  images  from  the  experimental  setup  and  was  able 
to  establish  good  correspondences  at  the  pixel  level,  for  views  separated  by  4.75 
degrees.  This  range  could  be  increased  by  a  sequential  bootstrapping  process.  If 
correspondences  can  be  automatically  determined,  then  the  entire  process  of  building 
view-based  models  for  3D  objects  can  be  made  fully  automatic. 

After  performing  the  experiments  reported  in  Chapter  10,  it  became  apparent  that 
the  views  were  separated  *'y  too  large  of  an  angle  (about  38  degrees)  for  establishing 
a  good  amount  of  feature  correspondence  between  some  views.  This  problem  may  be 
relieved  by  using  more  views.  Using  more  views  also  makes  automatic  determination 
of  correspondences  easier.  If  the  process  of  model  construction  is  fully  automatic, 
having  a  relatively  large  number  of  views  is  potentially  workable. 

The  work  of  Taylor  and  Reeves  [69]  provides  some  evidence  for  the  feasilnlity  of 
multiple-view-based  recognition.  They  describe  a  classification- based  vision  system 
that  uses  a  library  of  views  from  a  252  vertex  icosahedron-based  tesselation  of  the 
view  sphere.  Their  views  were  separated  by  6.0  to  8.7  degrees.  Thev  report  good 
classification  of  aircraft  silhouettes  using  this  approach. 


C-< 
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Chapter  5 

Modeling  Projection 

This  chapter  is  concerned  with  the  representations  of  image  and  object  features,  and 
with  the  projection  of  object  features  into  the  image,  given  the  pose  of  the  object. 
Four  different  formulations  are  described,  three  of  which  are  used  in  experiments 
reported  in  other  chapters. 

The  first  three  models  described  in  this  chapter  are  essentially  2D,  the  trans¬ 
formations  comprise  translation,  rotation,  and  scaling  in  the  plane.  Such  methods 
may  be  used  for  single  views  of  3D  objects  via  the  weak  perspective  approximation, 
as  described  in  [70].  In  this  scheme,  perspective  projection  is  approximated  by  or¬ 
thographic  projection  with  scaling.  Within  this  approximation,  these  methods  ran 
handle  four  of  the  six  parameters  of  rigid  body  motion  -  everything  but  out  of  plane 
rotations. 

The  method  described  in  Section  5.5,  is  based  on  Linear  Combination  of  Views, 
a  view-based  3D  method  that  was  developed  by  Cllman  and  Basri  [71]. 

5.1  Linear  Projection  Models 

f*ose  determination  is  often  a  component  of  model-ljased  object  recognition  system.s. 
including  the  systems  described  in  this  thesis.  Pose  determination  is  frequently  framed 
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as  an  optimization  problem.  The  pose  determination  problem  may  be  significantly 
simplified  if  the  feature  projection  model  is  linear  in  the  pose  vector.  The  systems  de¬ 
scribed  in  this  thesis  use  projection  models  having  this  property,  this  enables  solving 
the  embedded  optimization  problem  using  least  squares.  Least  squares  is  advanta¬ 
geous  because  unique  solutions  may  be  obtained  easily  in  closed  form.  This  is  a 
significant  advantage,  since  the  embedded  optimization  problem  is  solved  many  times 
during  the  course  of  a  search  for  an  object  in  a  scene. 

All  of  the  formulations  of  projection  described  below  are  linear  in  the  parameters 
of  the  transformation.  Because  of  this  they  may  be  written  in  the  following  form: 

Tj,  =  V(M,.S)  =  M,3  .  (5.1) 

The  pose  of  the  object  is  represented  by  3.  a  column  vector,  the  object  model 
feature  by  A/;,  a  matrix,  r?,,  the  projection  of  the  model  feature  into  the  image  by 
pose  /i,  is  a  column  vector. 

Although  this  particular  form  may  seem  odd,  it  a  natural  one  if  the  focus  is  on 
solving  for  the  pose  and  the  object  model  features  are  constants. 

5.2  2D  Point  Feature  Model 

The  first,  and  simplest,  method  to  be  described  was  used  by  Faugeras  and  .Ayache  in 
their  vision  system  HYPER  [1].  It  is  defined  as  follows:  t},  =  .V/,,i.  where 

If 

fs 

The  coordinates  of  object  model  point  i  are  and  p,y.  The  coordinates  of  the 
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model  point  i,  projected  into  the  image  by  pose  S,  are  p[^  and  This  transformation 
is  eciuivalent  to  rotation  by  6,  scaling  by  .s,  and  translation  by  T.  where 


This  representation  has  an  un-synunetrical  way  of  representing  the  two  classes 
of  features,  which  seems  odd  due  to  their  essential  equivalence,  however  the  trick 
facilitates  the  linear  formulation  of  projection  given  in  Equation  o.l. 

In  this  model,  rotation  and  scale  are  effected  by  analogy  to  the  multiplication  of 
complex  numbers,  which  induces  transformations  of  rotation  and  scale  in  the  cotnplex 
plane.  This  analogy  may  be  made  complete  by  noting  that  the  algebra  of  complex 
numbers  a  +  ib  is  isomorphic  with  that  of  matrices  of  the  form 

a  b 
—b  a 


■S  =  \/p^  A 


6  =  arctan  — 


5,3  2D  Point-Radius  Feature  Model 

This  section  describes  an  extension  of  the  previous  feature  model  that  incorporates 
information  about  the  normal  and  curvature  at  a  point  on  a  curve  (in  addition  to  the 
coordinate  information). 

There  are  advantages  in  using  richer  features  in  recognition  -  they  provide  more 
constraints,  and  can  lead  to  space  and  time  efficiencies.  These  potential  advantages 
must  be  weighed  against  the  practicality  of  detecting  the  richer  features.  For  example, 
there  is  incentive  to  construct  features  incorporating  higher  derivative  information  at 
a  point  on  a  curve;  however,  measuring  higher  derivatives  of  curves  derived  from  video 
imagery  is  probably  impractical,  because  each  derivative  magnifies  the  noise  present 
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Figure  5-1:  Edge  Curve,  Osculating  Circle,  and  Radius  Vector 

in  the  data. 

The  feature  described  here  is  a  compromise  between  richness  and  detectability.  It 
is  defined  as  follows  77,  =  M,3,  where 
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The  point  coordinates  and  .i  are  as  above.  and  represent  the  radius  vectoi' 
of  the  curve's  osculating  circle  that  touches  the  point  on  the  curve,  as  illustrated 
in  Figure  5-1.  This  vector  is  normal  to  the  curve.  Its  length  is  the  inverse  of  the 
curvature  at  the  point.  The  counterparts  in  the  image  are  given  by  and  With 
this  model,  the  radius  vector  c  rotates  and  scales  as  do  the  coordinates  p.  but  it  does 
not  translate.  Thus,  the  aggregate  feature  translates,  rotates  and  scales  correctly. 

This  feature  model  is  used  in  the  experiments  described  in  Sections  6.2.  7.4.  and 
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10.1  When  the  underlying  curvature  goes  to  zero,  the  length  of  the  radius  vector 
diverges,  and  the  direction  becomes  unstable.  This  has  been  acconunodated  in  the 
experiments  by  truncating  c.  Although  this  violates  the  “transforms  correctly"  crite¬ 
rion,  the  model  still  works  well. 

5.4  2D  Oriented- Range  Feature  Model 

This  feature  projection  model  is  very  similar  to  the  one  described  previously.  It  was 
designed  for  use  in  range  imagery  instead  of  video  imagery.  Like  the  previous  feature, 
it  is  fitted  to  fragments  of  image  edge  curves.  In  this  case.,  the  edges  label  discon¬ 
tinuities  in  range.  It  is  defined  just  as  above  in  Section  5.3.  but  the  interpretation 
of  c  is  different.  The  point  coordinates  and  R  are  a5  above.  As  above,  c,i  and  c,y 
are  a  vector  whose  direction  is  perpendicular  to  the  (range  discontinuity)  curve  frag¬ 
ment.  The  difference  is  that  rather  than  encoding  the  inverse  of  the  curvature,  the 
length  of  the  vector  encodes  instead  tiie  inverse  of  the  range  at  the  discontinuity.  The 
counterparts  in  the  image  arc  given  by  c'^.  and  The  aggregate  feature  translates, 
rotates  and  scales  correctly  when  used  with  imaging  models  where  the  object  features 
scale  according  to  the  inverse  of  the  distance  to  the  object.  This  holds  under  per¬ 
spective  projection  with  attached  range  labels  when  the  object  is  small  compared  to 
the  distance  to  the  object. 

This  model  was  used  in  the  experiments  de.,cribed  in  .Section  7.3. 

5.5  Linear  Combination  of  Views 

The  technique  used  in  the  above  methods  for  synthesizing  rotation  and  scale  amounts 
to  making  linear  combinations  of  the  object  model  with  a  copy  of  it  that  has  been 
rotated  90  degrees  in  the  plane. 

In  their  paper.  “Recognition  by  Linear  Combination  of  Models”  [71].  Ullman  and 
Basri  describe  a  scheme  for  synthesizing  views  under  3D  orthography  with  rotation 
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and  s^'ale  that  has  a  linear  parameterization.  They  show  that  the  space  of  images  of 
an  object  is  a  subspace  of  a  linear  space  that  is  spanned  by  the  components  of  a  few 
images  of  an  object.  They  discuss  variants  of  their  formulation  that  are  based  on  two 
views,  and  on  three  and  more  views.  Recovering  conventional  pose  parameters  from 
the  linear  combination  coefficients  is  described  in  [60]. 

The  following  is  a  brief  explanation  of  the  two-view  method.  The  reader  is  referred 
to  [71]  for  a  fuller  description.  Point  projection  from  3D  to  2D  under  orthography,  ro¬ 
tation,  and  scale  is  a  linear  transformation.  If  two  (2D)  views  are  available,  along  with 
the  transformations  that  produced  them  (as  in  stereo  vision),  then  there  is  enough 
data  to  invert  the  transformations  and  solve  for  the  3D  coordinates  (three  equations 
are  needed,  four  are  available).  The  resulting  expression  for  the  3D  coordinates  will 
be  a  linear  equation  in  the  components  of  the  two  views.  New  2D  views  may  then 
be  synthesized  from  the  3D  coordinates  by  yet  another  linear  transformation.  Com¬ 
pounding  these  linear  operations  yields  an  expression  for  new  2D  views  that  is  linear 
in  the  components  of  the  original  two  views.  There  is  a  quadratic  constraint  on  the 
3D  to  2D  transformations,  due  to  the  constraints  on  rotation  matrices.  The  usual  Lin¬ 
ear  Combination  of  Views  approach  makes  use  of  the  above  linearity  property  while 
synthesizing  new  views  with  general  linear  transformations  (without  the  constraints). 
This  practice  leads  to  two  extra  parameters  that  control  stretching  transformations 
in  the  synthesized  image.  It  also  reduces  the  need  to  deal  with  camera  calibrations  - 
the  pixel  aspect  ratio  may  be  accommodated  in  the  stretching  transformations. 

The  following  projection  model  uses  a  two  view  variant  of  the  Linear  Combination 
of  Views  method  to  synthesize  views  with  limited  3D  rotation  and  scale.  .Additionally, 
translation  has  been  added  in  a  straightforward  way.  =  M,.j.  where 

^iiT  Vii  b  0  p,y  0  1  0 

7?,  =  .V/.  = 

7/iy  0  Piy  0  (py  0  0  1 
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The  roordiiiates  of  the  i’th  point  in  one  view  are  p,^  and  p,y-,  in  the  otlier  view 
they  are  <7,^  and  q,y. 

When  this  projection  model  is  used,  3  does  not  in  general  describe  rigid  transfor¬ 
mation.  but  it  is  nevertheless  called  the  pose  vector  for  notational  consistency. 

This  method  is  used  in  the  experiment  described  in  Section  10.4. 
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Chapter  6 

MAP  Model  Matching 

MAP  Model  Matching  '  (MMM)  is  the  first  of  two  statistical  formulations  of  object 
recognition  to  be  discussed  in  this  thesis.  It  builds  on  the  models  of  features  and 
correspondences,  objects,  and  projection  that  are  described  in  the  previous  chapters. 
MMM  evaluates  joint  hypotheses  of  match  and  pose  in  terms  of  their  posterior  prob¬ 
ability,  given  an  image.  MMM  is  the  starting  point  for  the  second  formulation  of 
object  recognition.  Posterior  Marginal  Pose  Estimation  (PMPE),  which  is  described 
in  Chapter  7. 

The  MMM  objective  function  is  amenable  to  search  in  correspondence  space, 
the  space  of  all  possible  assignments  from  image  features  to  model  and  background 
features.  This  style  of  search  has  been  used  in  many  recognition  systems,  and  it  is 
used  here  in  a  recognition  experiment  involving  low  resolution  edge  features. 

It  is  shown  that  under  certain  conditions,  searching  in  pose  space  for  maxima  of 
the  M.MM  objective  function  is  equivalent  to  robust  methods  of  chamfer  matching 
[47]. 


'Early  versions  of  this  work  appeared  in  [74]  and  [T.'j], 
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6.1  Objective  Function  for  Pose  and  Correspon¬ 
dences 


In  this  section  an  objective  iunctioi»  for  evaluating  joint  hypotheses  of  match  ami 
pose  using  the  MAP  criterion  will  be  .lerived. 

Briefly,  probability  densities  of  image  features,  conditioned  on  the  parameters  of 
match  and  pose  ("the  parameters"),  are  combijied  with  prior  probabilities  on  tiie 
parameters  using  Bayes'  rule.  The  result  is  a  posterior  probability  density  on  the  pa¬ 
rameters.  given  an  observed  image.  An  estimate  of  the  parameters  is  then  formulated 
by  choosing  them  so  as  to  maximize  their  a-posteriori  probability.  (Hence  the  term 
Map.  See  Beck  and  Arnold's  textbook  [4]  for  a  discussion  of  .M.AP  estimation.  )  .M.AF’ 
estimators  are  especially  practical  when  used  with  normal  probability  densities. 

This  research  focuses  on  feature  based  recognition.  The  probabilistic  mod.  of 
image  features  described  in  Chapter  .3  are  used.  Initially,  image  features  are  assumed 
to  be  mutually  independent  (this  is  relaxed  in  Section  6.1.1).  Additionally,  matcheil 
image  features  are  assumed  to  be  normally  distributed  about  their  predicted  positions 
in  the  image,  and  unmatched  (background)  features  are  assumed  to  l)e  uniformly 
distributed  in  the  image.  These  densities  are  combined  with  a  prior  model  of  the 
parameters.  When  a  linear  projection  model  is  used,  a  simple  objective  function  for 
match  and  pose  results. 

.As  described  in  Chapter  2.  the  image  that  is  to  be  analyzed  is  represented  liy  a 
set  of  ('-dimensional  column  vectors. 

>■  =  {>,.>2 . >;.}  .  >;  (E  H'  . 

The  object  mode!  is  denoted  by  M . 


A/  =  {.\/,..\/2 . M,n} 
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When  linear  projection  models  are  used,  as  discussed  in  Chapter  o.  the  ol)ject  features 
will  he  represented  by  real  matrices:  Xf,  €  (~  is  defined  below). 

The  parameters  to  be  estimated  in  matching  are  the  correspondences  between 
image  and  object  features,  and  the  pose  of  the  object  in  the  image.  As  discussed  in 
Section  2.1,  the  state  of  match,  or  correspondences,  is  described  by  the  variable  F: 

r  =  {ri,r2,...,r„}  ,  r.€A/u{i}  . 

Here  F,  =  A/j  means  that  image  feature  i  corresponds  to  object  model  feature  j.  and 
F,  =-L  means  that  image  feature  i  is  due  to  the  background. 

The  pose  of  the  object  is  a  real  vector;  fS  €  fif*.  projection  function,  V( ),  maps 
object  model  features  into  the  n-dimensional  image  coordinate  space  according  to  the 
pose, 

V{M„,3)eR^  . 

The  prol)abilistic  models  of  image  features  described  in  Chapter  3  may  l)e  written 
as  follows: 

p(y.;  1  FJ^)  =  I  ifr,  =1 

ifF.  =  A/, 

where 

Here  is  the  covariance  matrix  associated  with  image  feature  i  and  ol)ject  model 
feature  j.  Thus  image  features  arising  from  the  background  are  uniformly  distributed 
over  the  image  feature  coordinate  spare  (the  extent  of  the  image  feature  coordinate 
space  along  dimension  i  is  given  by  W',),  and  matched  image  features  are  normally 
distributed  about  their  predicted  locations  in  the  image.  In  some  applications  v  could 
be  independent  if  /  and  j  -  an  assumption  that  tlie  feature  statistics  are  stationary 
in  the  image,  or  rp  may  depend  only  on  i.  the  image  feature  index.  The  latter  is  the 
case  when  the  oriented  stationary  statistics  model  is  used  (see  .Section  3.3). 
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Oh 

Assuming  independent  features,  the  joint  probability  density  on  image  feature 
coordinates  may  be  written  as  follows 


I  t:r,  =  l  *  ^  ^ 


{(i.2l 


This  assumption  often  holds  when  sensor  noise  dominates  in  feature  fluctuat 


ations. 


The  next  step  in  the  derivation  is  the  construction  of  a  joint  prior  on  correspon¬ 
dences  and  pose.  In  ( Chapter  2.  probabilistic  models  of  feature  correspondences  were 
discussed.  The  independent  correspondence  model  is  used  here  for  simplicity.  I'.se  of 
the  Markov  correspondence  model  is  discussed  in  the  following  section.  The  proba¬ 
bility  that  image  feature  i  belongs  to  the  background  is  B,.  while  the  remaining  prob¬ 
ability  is  uniformly  distributed  for  correspondences  to  the  tt!  object  model  features. 
In  some  situations,  B,  may  be  a  constant,  independent  of  t.  Recalling  E(]uations  2.1 
and  2.6. 

P(r)  =  nP(r<)  P(r,)  =  I  ^  (6.3) 

I  otherwise  . 


Prior  information  on  the  pose  is  assumed  to  be  supplied  as  a  normal  density. 


P(  .3)  =  -  Jo) 

where 

C,,_3(x)  =  {27r)-^|i/-^|-jexp(-^x^t>a'x)  . 

Here  vj  is  the  covariance  matrix  of  the  pose  prior  and  c  is  the  dimensionality  of 
the  pose  vector.  J.  W  ith  the  combination  of  normal  pose  priors  and  linear  projection 
models  the  system  is  closed  in  the  sense  that  the  resultitig  ])ose  estimate  will  also 
be  normal.  This  is  convenient  for  coarse-fine,  as  iliscussed  in  Section  6.4.  If  little  is 
known  about  the  pose  a-priori,  the  prior  may  be  made  quite  broad.  This  is  expected 
to  be  often  tlie  case.  If  nothing  is  known  about  the  pose  beforehand,  tlie  pose  prior 
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may  be  left  out.  In  tliat  case  the  resulting  criterion  for  evaluating  hypotheses  will  be 
based  on  Maximum  Likelihood  for  pose,  and  on  MAP  for  correspondences. 

.Assuming  independence  of  the  correspondences  and  the  pose  (before  the  image  is 
compared  to  the  object  model),  a  mixed  joint  probability  function  may  be  written  as 
follows. 


p(r..j)  =  j-.St,)  n  e,  n 


\-B, 


Jll 


This  a  good  assumption  when  view-based  approaches  to  object  modeling  are  used 
(these  are  discussed  in  Chapter  4  and  used  in  the  experiments  described  in  Chapter 
10).  (With  general  3D  rotation  it  is  inaccurate,  as  the  visibility  of  features  depends 
on  the  orientation  of  the  object.)  This  probability  function  on  match  and  pose  is  now 
used  with  Bayes’  rule  as  a  prior  for  obtaining  the  posterior  probability  of  P  and  J: 


p(rw^|r)  = 


p(r|r,/j)p(r,/j) 

p(y} 


(6.4) 


where  p{Y)  =  J2r  f  Pi^  I  T, /3)p(r, is  a  normalization  factor  that  is  formally 
the  probability  of  the  image.  It  is  a  constant  with  respect  to  P  and  the  parameters 
being  estimated. 

The  MAP  strategy  is  used  to  obtain  estimates  of  the  correspondences  and  pose 
by  maxiudzing  their  posterior  probability  with  respect  to  P  and  3,  as  follows 


r,  3  =  argmax  p(P./^  ]  Y) 


For  convenience,  an  objective  function.  L.  is  introduced  that  is  a  scaled  logarithm 
of  p(P.d  I  V  ).  The  same  estimates  will  result  if  the  maximization  is  instea<l  <-arried 
out  over  L. 

P.  3  =  argmax  Z,(P.  3) 
r.i) 


L(r.3)  =  In 


(HT^) 


(().■')) 


where 
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The  denominator  in  Erjuation  0.')  is  a  ('oiistant  that  has  been  chosen  to  cancel  cuii- 
stants  from  the  numerator.  Its  value,  wliich  is  independent  of  F  and  J  is 


r  = 


B,  B,  --  Bn 
(H'lH’i--- Vi-;,)’* 


MY] 


After  some  manipulation  the  objective  function  may  be  expressed  as 


F(r.,i)  =  [x,^--iY,-V(M,,3)f,,;;(Y-V{M,..i)}] 

(b.b) 


.  where 

.  ^  (i-B.jWdVvy  -vF; 


(6.7) 


When  a  linear  projection  model  is  used,  V(Mj.3)  =  Mj3.  (Linear  projection 
models  were  discussed  in  Chapter  5.)  In  this  case,  the  objective  function  takes  the 
following  simple  form 


L(r,T)  =  -i(T-/io)^v-;’(/^-/^)+  Y.  •  (6-^) 

^  i]:r,=M,  ^ 

When  the  background  probability  is  constant,  and  when  the  feature  covariance 
matrix  determinant  is  constant  (as  wdien  oriented  stationary  statistics  are  u,sed).  the 
formulas  simplify  further  - 


and 


A  =  In 


1  {I-  B)  vv’,  vt;  •  ■  -  m; 


{2TT)2jn  B 


r  2 


(6.9) 


UT.3)^-U3-3D'^,r-,'{3-3,)+  [A-1(V; 

.j:r.=,v/, 


M,3fv;^{Y-M,3]\  .  (b.lO) 


Here.  is  the  stationary  feature  covariance  matrix,  and  i\  is  the  specialized 
feature  covariance  matrix.  These  were  discussed  in  .Section  3. '5. 


O'.  I .  OBJECTIVE  El  WCTIOX  FOR  POSE  AM)  C'ORRESPOSDEXl  'ES 


The  first  term  of  the  objective  function  of  Equation  6.8  expresses  the  infiuence  of 
the  prior  on  the  pose.  As  discussed  above,  wlien  a  useful  pose  prior  isn't  available, 
this  term  may  be  dropped. 

The  second  term  has  a  simple  interpretation.  It  is  a  sum  taken  over  those  image 
features  that  are  matched  to  object  model  features.  The  A,j  are  fixed  rewards  for 
making  correspondences,  while  the  quadratic  forms  are  penalties  for  deviations  of  ob¬ 
served  image  features  from  their  expected  positions  in  the  image.  Thus  the  objective 
function  evaluates  the  amount  of  the  image  explained  in  terms  of  the  object,  with 
penalties  for  mismatch.  This  objective  function  is  particularly  simple  in  terms  of  ,i. 
When  r  is  constant,  /'i  and  its  (posterior)  covariance  are  estimated  by  weighted  least 
squares.  W'hen  using  an  algorithm  btised  on  search  in  correspondence  space,  the  es¬ 
timate  of  3  can  be  cheaply  updated  by  using  the  techniques  of  sequential  parameter 
estimation.  (See  Beck  and  Arnold  [4].)  The  A,j  describe  the  relative  value  of  a  matcli 
component  or  extension  in  a  way  that  allows  direct  comparison  to  the  entailed  mis¬ 
match  penalty.  The  values  of  these  trade-off  parameter(s)  are  supplied  by  the  theory 
(in  Equation  6.7)  and  are  given  in  terms  of  measurable  domain  statistics. 

The  form  of  the  objective  function  suggests  an  optimization  strategy:  make  cor¬ 
respondences  to  object  features  in  order  to  accumulate  correspondence  rewards  while 
avoiding  penalties  for  mismatch.  It  is  important  that  the  A,^  be  positive,  otherwise  a 
winning  strategy  is  be  to  make  no  matches  to  the  object  at  all.  This  condition  defines 
a  critical  level  of  image  clutter,  beyond  which  the  .MAP  criteria  assigns  the  feature  to 
the  background.  A,^  describes  the  dependence  of  the  value  of  matches  on  the  amount 
of  background  clutter.  If  background  features  are  scarce,  then  correspondences  to 
object  features  become  more  important. 

This  objective  function  provides  a  simple  and  uniform  way  to  evaluate  match 
and  pose  hypotheses.  It  captures  important  aspects  of  recognition;  the  amount  of 
image  explained  in  terms  of  the  object,  as  well  as  the  metrical  consistency  of  the 
h\  |)uthesis;  and  it  trades  them  off  in  a  rational  way  l)a.sed  uii  domain  statistics.  Most 
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previous  approaches  have  not  made  use  of  botli  criteria  simultaneously  in  evaluating 
h\p(jtheses.  thereby  losing  some  robustness. 

6.1.1  Using  the  Markov  Correspondence  Model 

When  the  Markov  correspondence  model  of  Section  2.3  is  used  instead  of  the  indepen¬ 
dent  correspondence  model,  the  functional  form  of  the  objective  function  of  Eciuation 
6.6  remains  essentially  unchanged,  aside  from  gaining  a  new  term  that  raptures  the 
influence  of  the  interaction  of  neighboring  features.  The  names  of  some  of  the  con¬ 
stants  changes,  reflecting  the  difference  between  Equations  2.2  and  2.4.  .Noting  that 
piT,3  I  V’)  is  linear  in  p{T),  it  can  be  seen  that  the  new  term  in  the  logarithmic 
objective  function  will  be: 

lnr,(r,,r,+,)  . 

«=i 

.4s  before,  when  an  algorithm  based  on  search  in  correspondence  space  is  u.sed,  the 
estimate  of  3  can  still  be  cheaply  updated.  A  change  in  an  element  of  correspondence, 
some  Ej,  will  now  additionally  entail  the  update  of  two  of  the  terms  in  the  expression 
above. 


6.2  Experimental  Implementation 

In  this  section  an  experiment  demonstrating  the  use  of  the  .M.MM  objective  func  tion 
is  described.  The  intent  is  to  demonstrate  the  utility  of  the  objective  function  in  a 
domain  of  features  that  have  significant  fluctuations.  The  features  are  derived  from 
real  images.  The  domain  is  matching  among  features  from  low-resolution  edge  images. 
The  point-radius  feature  model  discussed  in  .Section  5.3  is  u.sed.  Oriented  stationary 
statistics,  as  described  in  Section  3.3.  are  used  to  model  the  feature  fluctuations,  so 
that  X,j  =  4,. 
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6.2.1  Search  in  Correspondence  Space 


(I'uocl  solutiuns  of  the  objective  function  of  Ecjuation  b.S  are  sought  by  a  sear(  h  in 
('orrespotulence  space.  Search  over  tlie  whole  exj)onential  space  is  avoided  by  heuristic 
pruning. 

An  objective  function  that  evaluates  a  configuration  of  correspondences,  or  match 
(described  by  P),  may  be  obtained  as  follows: 

£(r)  =  max  l(r.  J)  . 

ij 

This  optimization  is  quadratic  in  and  is  carried  out  by  least  squares.  Sequential 
techniques  are  used  so  that  the  cost  of  extending  a  partial  match  by  one  correspon¬ 
dence  is  0(  I )  . 

The  space  of  correspondences  may  be  organized  as  a  directed-acyclic-graph  (DACJ ) 
by  the  following  parent-child  relation  on  matches.  A  point  in  correspondence  space, 
or  match  is  a  child  of  another  match  if  there  is  some  i  such  that  P,  =±  in  the  parent, 
and  Pj  =  Mj,  for  some  j,  in  the  child,  and  they  are  otherv/ise  the  same.  Thus,  the 
child  has  one  more  assignment  to  the  model  than  the  parent  does.  This  D.AC  is  rooted 
in  the  match  where  all  assignments  are  to  the  background.  All  possible  matches  are 
reachable  from  the  root.  .A  fragment  of  an  example  DACl  of  this  kind  is  illustrated 
in  Figure  6-1.  Components  of  matches  that  are  not  explicit  in  the  figure  are  assigned 
to  the  background. 

Heuristic  l>eam  search,  as  described  in  [64].  is  used  to  search  over  matches  fur  guud 
solutiuns  of  C.  .Success  depends  on  the  heuristic  that  there  aren't  many  impostors  in 
the  image.  .An  impostor  is  a  .set  of  image  features  that  scores  well  but  isn't  a  subset 
of  the  optimum  match  inrplied  by  the  objective  function.  Another  way  of  stating  the 
heuristic  is  that  the  best  match  to  n  -|-  1  object  features  is  likely  to  contain  the  best 
match  to  n  object  features. 

The  search  method  used  in  the  experiments  employs  a  bootstrapping  niechanism 
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Figure  6-1:  Fragment  of  Correspondence  Space  DAC 

based  on  distinguished  features.  Object  features  I.  2  and  .3  are  special,  and  must 
be  detected.  The  scheme  could  be  made  robust  by  considering  more  initial  triples 
of  object  features.  Alternatively,  indexing  methods  could  be  used  as  an  efficient  and 
robust  means  to  initiate  the  search.  Indexing  methods  are  described  by  Clemens  and 
•Jacobs  [19],  and  in  Section  9.1. 

The  algorithm  that  was  used  is  outlined  below. 

BEA.Vf-SEARCH(.W.  Y) 

Current  *—  {F:  exactly  one  image  feature  is  matched  to  each  of  M\  M2  and  .V/,} 
the  rest  are  assigned  to  the  background. 

F’rune  CURRENT  according  to  C.  Keep  50  best. 

Iterate  to  Fixpoint: 

.4dd  to  CURRE.NT  all  children  of  members  of  CURRENT 
F^rune  CURRE.NT  according  to  jC.  Keep  ,V  best. 

.V  /.S'  reduced  from  20  to  5  a.s  the  search  proceeds. 


a. 2.  EXPERIMESTAL  IMPLEMEST.VriOS 


Figure  6-2:  Images  used  for  Matrliiug 


Returu(  Current) 

Sometimes  an  extension  of  a  match  will  produce  one  that  is  already  in  CUR¬ 
RENT,  that  was  reached  in  a  different  sequence  of  extensions.  When  this  hapix-us. 
the  matches  are  coalesced.  This  condition  is  efficiently  detected  l>y  It',  ting  for  m'ar 
equality  of  the  scores  of  the  items  in  CURRENT.  Because  the  features  are  derived  from 
observations  containing  some  random  noise,  it  is  very  unlikely  that  two  hypotlu'ses 
having  differing  matches  will  achieve  the  same  score,  since  the  score  is  partly  based 
on  summed  squared  errors. 

6.2.2  Example  Search  Results 

The  search  method  described  in  the  previous  section  was  used  to  obtain  good  matches 
in  a  domain  of  features  that  have  significant  fluctuations.  The  features  were  derived 
from  real  images.  A  linear  projection  model  was  used. 

Images  used  for  matching  are  shown  in  Figure  6-2.  The  object  model  was  derived 
from  a  set  of  16  images,  of  which  the  image  on  the  left  is  an  example.  In  this  set.  only 
the  light  source  position  varied.  The  image  features  used  in  the  search  were  derived 
from  the  image  on  the  right. 

The  features  used  for  matching  were  derived  from  the  edge  maps  shown  in  Figure 
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Figure  Edge  Maps  used  tor  Mat<'hing 


6-d.  The  image  tui  tiie  left  shows  the  ol)je(  t  model  edges  ami  the  image  on  the  r'ght 
shows  the  image  edges.  These  edges  are  from  the  Canny  edge  detector  [Id],  The 
smoothing  standard  deviation  is  eight  pixels  -  these  are  low  resolution  eilge  maps. 
The  object  model  edges  were  derived  from  a  set  of  16  edge  maps.  corrt*sponding  to  the 
16  images  described  above.  The  object  model  edges  are  essentially  the  mean  edges 
with  respect  to  fluctuations  induced  by  variations  in  lighting.  (Low  resolution  edges 
are  sensitive  to  lighting.)  They  are  similar  to  the  Mean  Edge  Images  described  in 
.Section  4.4. 

The  features  used  in  matching  are  shown  in  Figure  6-4.  These  are  point-radius 
features,  as  described  in  Section  6.3.  The  point  coordinates  of  the  features  are  indi¬ 
cated  1)V  dots,  while  the  normal  vector  and  curvature  are  illustrated  by  arc  fragments. 
Each  feature  represents  .30  edge  pixels.  The  40  object  features  appear  in  the  upper 
picture,  the  126  image  features  lower  picture.  The  distinguished  feat>ires  used  in  the 
bootstrap  of  the  search  are  indicated  with  circles.  The  object  features  have  been 
transformed  to  a  new  pose  to  insure  generality. 

The  parameters  that  appear  in  the  objective  function  are:  B.  the  background 
probability  and  (p.  the  stationary  feature  covariance.  These  were  tlerived  from  a 
match  done  by  hand  in  the  example  domain.  The  oriented  stationary  statistics  model 
of  Section  3,3  was  used  here.  (A  normal  model  of  feature  fluctuations  is  im})licit  in 
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Figure  6-5;  Pose  Prior  used  in  Searrli 


the  objertive  function  of  Equation  6.8.  This  was  found  to  be  a  good  model  in  this 
domain.) 

A  loose  pose  prior  was  used.  This  pose  prior  is  illustrated  in  Figure  6-5.  The  prior 
places  the  object  in  the  upper  left  corner  of  the  image.  The  one  standard  deviation 
intervals  of  position  and  angle  are  illustrated.  The  one  standard  deviation  \  ariation  of 
scale  is  30  percent.  The  actual  pose  of  the  object  is  within  the  indicated  one  staiulard 
deviation  bounds.  This  prior  was  chosen  to  demonstrate  that  the  method  works  well 
despite  a  loose  pose  prior. 

The  best  results  of  the  beam  search  appear  in  Figure  6-6.  In  the  upper  image, 
the  object  features  are  delineated  with  heavy  lines.  They  are  located  according  to 
the  pose  associated  with  the  best  match.  In  the  lower  image,  the  object  features  and 
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image  features  are  illustrated,  while  the  IS  rurrespoudeuces  associated  with  the  best 
match  appear  as  heavy  lines  and  dots. 

The  object  features  located  according  to  the  poses  associated  with  the  h\e  l»est 
matches  are  seen  in  Figure  6-7.  The  results  are  difficult  to  distinguish  because  the 
poses  are  very  similar. 

6.3  Search  in  Pose  Space 

This  section  will  explore  searching  the  MMM  objective  function  in  pose  space.  Con¬ 
nections  to  robust  chamfer  matching  will  be  described. 

A  pose  estimate  is  sought  by  ordering  the  search  for  maxima  of  the  MMM  objective 
function  as  follows. 

T  =  argmax  max  L[r .  i)  . 
j  r 

Substituting  the  objective  function  from  Equation  6.6  yields 

;}  =  argmax  max  ^  [A.^  -  -  P(A/,.  (T.  -  ■^))]  • 

The  pose  prior  term  has  been  dropped  in  the  interest  of  clarity.  It  would  be  easily 
retained  as  an  additional  quadratic  term. 

This  equation  may  be  simplified  with  the  following  definition. 

1  j 

—  2“^  ■ 

D,j{x)  may  be  thought  of  as  a  generalized  squared  distance  between  observed  and 
predicted  features.  It  has  been  called  the  squared  .Mahalonobis  distance  [22]. 

The  pose  estimator  may  now  be  written  as 

j  =  argmax  max  ^  [A._,  -  D,j(Y,  -  V{Mj,  S))]  . 
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Figure  6-7:  Best  Five  Match  Results 


or  equivaleutiy.  as  a  niininuzatioii  rather  that  inaxiniizatioii. 

7-argnun  min  V  [0,j(Y,  -  V(  Mj.  .i])  -  . 

The  sum  is  taken  over  those  image  features  that  are  assigned  to  model  featun 
(not  th(’  background  I  in  the  match.  It  may  be  re-writtei;  in  the  following  wa\. 


“  arg  min  ^  iniii 


0 


if  r, 


A,(>;  -  -  A,,  if  lb  =  M 


ur  as 


S  =  arg  min  ^  imidO.  min  '  i  —  Pf  .V/  .  Tl '  -  A.  i  . 

t 

If  t  he  curres|)ondence  reward  is  independent  of  the  nuxlel  feature  (this  liuld-  wh*- 
oriented  stationary’  statistics  are  used).  A,,  —  A,.  In  this  <-ase.  A,  ma’\’  be  added  i 
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eacli  term  in  the  sum  without  affecting  the  minimizing  pose,  yielding  tlie  following 
form  for  the  pose  estimator. 

j  =  argmin  ^  min(A..  nun  -  V{Mj.  .i]))  .  (0.11) 

^  J 

I 

This  objective  function  is  easily  interpreted  -  it  is  the  sum.  taken  over  image 
features  of  a  saturated  penalty.  The  penalty  (before  saturation)  is  the  smallest  gen¬ 
eralized  squared  distance  from  the  observed  image  feature  to  some  projected  model 
feature.  The  penalty  min^  D,j{x  -  V{Mj,d))  has  the  form  of  a  Voronoi  surface,  as 
described  by  Huttenlocher  et.  al.  [42].  They  describe  a  measure  of  similarity  on 
image  patterns,  the  Hausdorff  distance,  that  is  the  upper  envelope  (maximum)  of 
Voronoi  surfaces.  The  measure  used  here  differs  in  being  saturated,  and  by  using  tlie 
sum  of  Voronoi  surfaces,  rather  than  the  upper  envelope.  In  their  work,  the  upper 
envelope  offers  some  reduction  in  the  complexity  of  the  measure,  and  facilitates  the 
use  of  methods  of  computational  geometry  for  explicitly  computing  the  measure  in  2 
and  '■]  dimensional  spaces. 

Computational  geometry  methods  might  be  useful  for  computing  the  objective 
function  of  Equation  6.11.  In  higher  dimensional  pose  spaces  (4  or  6.  for  example) 
KD-tree  methods  may  be  the  only  such  techniques  currently  available.  Breuel  has 
used  Kf)-tree  search  algorithms  in  feature  matching. 

.Next  a  connection  will  be  shown  between  .\1.\1M  search  in  pose  space  and  a  method 
of  robust  chamfer  matching.  First,  the  domain  of  .\1.\1M  is  simplified  in  the  followino 
way.  Pull  stationarity  of  feature  fluctuations  is  assumed  (as  covered  in  Section  d.  li. 
Further,  the  feature  covariance  is  assumed  to  be  isotropic.  With  these  assumption'- 
wc  have  i.’,,  =  rr~ I .  and  •'Vdditiijiiaily.  as.sumine  constant  bai  kcround 

probability  wc  ha\'e  A,^  =  A.  Th*^  pose  estimator  of  E(|uation  (i.ll  ma\  iio\\  bf 
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written  in  the  following  simplified  form. 

S  =  argmin  min(A.  min(-^  |V;  -  P(Mj.  SH^))  . 

•j  j  Lo 

When  the  projection  fiiiiction  is  linear,  invertible,  and  distance  preserving.  (21) 
and  dl)  rigid  transformations  satisfy  these  properties),  the  estimator  may  be  expre>se(l 
as  folli>vvs. 


S  =  arg  min 


min(A.  min( — - 


This  may  be  further  simplified  to 


T  =  argmin  ^  miu(A.  .d)))  ,  (6.121 

ii 

t 

by  using  the  follow'ing  definition  of  a  minimum  distance  function. 


dii)  =  —j=-  min  |jr  —  Mj\  . 

V2cr  j 


(b.id) 


( 'hamfering  methods  may  be  used  to  tabulate  approximations  of  d^{x)  in  an  image¬ 
like  array  that  is  indexed  by  pixel  coordinates.  Chamfer- based  ap})roaches  to  image 
registration  problems  use  the  array  to  facilitate  fast  evaluation  of  pose  .objective 
functions.  Barrow  et  al.  [.f]  describe  an  early  metho<i  where  the  objective  function 
is  the  sum  over  model  features  of  the  distance  from  the  projected  model  feat>ire  to 
the  nearest  image  feature.  Borgefors  [S]  recommends  the  use  of  R.\1S  distance  rather 
than  summed  distance  in  the  objective  function. 

Kecentl.\'.  .bang  et  al.  [47]  describeil  a  method  of  robust  chamfer  matching.  In 
ortlei  to  make  the  method  less  susc'ept ibl«>  to  disturbance  by  outliers  and  occlusion>. 
tl>»^y  ;iilded  saturation  ttj  the  R,\1S  objective  function  of  Borgefors.  Their  objecti\f 
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iuiiction  has  the  following  form 

•5  7^'^  ^ 

J 

where  dj  is  the  squared  distance  from  the  j'th  projected  model  point  to  the  near¬ 
est  image  point.  .Aside  from  the  constants  and  square  root,  which  don't  affect  the 
minimizing  pose,  this  objective  function  is  equivalent  to  Equation  6.12  if  the  role  of 
image  and  model  features  is  reversed,  and  the  sense  of  the  projection  function  is  in¬ 
verted.  .Jiang  et  al.  show  impressive  results  using  robust  chamfer  matching  to  register 
multi-modal  3D  medical  imagery. 

6.4  Extensions 

MAP  Model  Matching  performs  well  on  low  resolution  imagery  in  which  feature 
uncertainty  is  significant.  It  could  be  used  to  bootstrap  a  coarse-Hne  approach  to 
model  matching,  yielding  good  results  with  reasonable  running  times,  ('oarse-fine 
approaches  have  proven  successful  in  stereo  matching  applications.  (See  Crimson 
[33]  and  Barnard  [2].)  .A  coarse-fine  strategy  is  straightforward  in  the  framework 
described  here.  In  a  hierarchy,  the  pose  estimate  from  solving  the  objective  function 
at  one  scale  is  used  as  a  prior  for  the  estimation  at  the  next.  Having  a  good  prior  on 
the  pose  will  greatly  reduce  the  amount  of  .searching  required  at  high  resolution. 

Fimling  a  tractable  model  that  incorporates  po.se  dependent  visibility  conditions 
would  l)e  u.seful  for  applying  .M.M.M  in  non  view-basetl  recognition. 

6.5  Related  Work 

The  HYPER  vision  system  of  .Ayache  and  Faugeras  [1]  u.ses  .se(|uential  linear-least - 
squares  pose  estimation  as  well  as  the  linear  2D  point  feature  and  projection  model 
described  in  Section  •).2.  H^  PER  is  described  as  a  search  algorithm.  Different  criteria 


RELATED  WORK 


are  used  to  evaluate  candidate  matches  and  to  evaluate  competing  “whole"  hypothe¬ 
ses.  An  ad  hoc  threshold  is  used  for  testing  a  continuous  measure  of  the  metrical 
consistency  of  candidate  match  extensions.  Whole  match  hypotheses  are  evaluated 
according  to  the  amount  of  image  feature  accounted  for  -  although  not  according  to 
overall  metrical  consistency.  HYPER  works  well  on  real  images  of  industrial  part.s. 

(load  outlined  a  Bayesian  strategy  of  match  evaluation  based  on  feature  and 
background  statistics  in  his  paper  on  automatic  programming  for  model-based  vision 
[29].  In  his  system,  search  was  controlled  by  thresholds  on  probabilistic  measures  of 
the  reliability  and  plausibility  of  matches. 

Lowe  describes  in  general  terms  the  application  of  Bayesian  techniques  in  his  book 
on  Visual  Recognition  [51].  He  treats  the  minimization  of  expected  running  time  of 
recognition.  In  addition  he  discusses  selection  among  numerous  objects. 

Object  recognition  matching  systems  often  use  a  strategy  that  can  be  summarized 
as  a  search  for  the  maximal  matching  that  is  consistent.  Consistency  is  frequently 
defined  to  mean  that  the  matching  image  feature  is  within  finite  bounds  of  its  expected 
position  (bounded  error  models).  Cass'  system  [14]  is  one  example.  Such  an  approach 
may  be  cast  in  the  framework  defined  here  by  assuming  uniform  probability  density 
functions  for  the  feature  deviations.  Pose  solution  with  this  approach  is  likelv  tu  be 
more  complicated  than  the  sequential  linear-least-squares  method  that  can  be  used 
when  feature  deviations  have  normal  models.  Cass"  approach  effectively  finds  the 
global  optimum  of  its  objective  function.  It  perfortns  well  on  occluded  or  fragmented 
real  images. 

Beveridge,  Weiss  and  Riseman  [6]  use  an  objective  function  for  line  segment  based 
recognition  that  is  similar  to  the  one  described  here.  In  their  work,  the  penalty  for 
deviations  is  (juadratic.  while  the  reward  for  correspondence  is  non-linear  (expijiien- 
tial)  in  the  amount  of  missing  segment  length.  (By  contrast,  the  reward  described  in 
this  paper  is.  for  stationary  models,  linear  in  the  length  of  aggregate  features.)  The 
tracle-off  parameters  in  their  objective  function  were  determined  etnpirically.  Their 
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system  gives  good  performance  in  a  domain  of  real  images. 

Burns  and  Riseman  [12]  and  Burns  [11]  describe  a  classification  based  recognition 
system.  They  focus  on  the  use  of  description  networks  for  efficiently  searching  among 
multiple  objects  with  a  recursive  indexing  scheme. 

Hanson  and  Fua  [27]  [26]  describe  a  general  objective  function  approach  to  image 
understanding.  They  use  a  minimum  description  length  (MDL)  criterion  that  i> 
designed  to  w’ork  wdth  generic  object  models.  The  approach  presented  here  is  tailoreil 
for  specific  object  models. 


6.6  Summary 

A  MAP  model  matching  techni<iue  for  visual  object  recognition  has  been  described. 
The  resulting  objective  function  has  a  simple  form  when  normal  feature  deviation 
models  and  linear  projection  models  are  used.  Experimental  results  were  shown 
indicating  that  MAP  Model  Matching  works  well  in  low  resolution  matching,  where 
feature  deviations  are  significant.  Related  work  was  discussed. 


Chapter  7 


Posterior  Marginal  Pose 
Estimation 


In  the  previous  chapter  on  MAP  Model  Matching  the  object  recognition  proldem  was 
posed  as  an  optimization  problem  resulting  from  a  statistical  theory.  In  that  formu¬ 
lation,  complete  hypotheses  consist  of  a  description  of  the  correspondences  between 
image  and  object  features,  as  well  as  the  pose  of  the  object.  The  method  was  shown 
to  provide  effective  evaluations  of  match  and  pose. 

The  formulation  of  recognition  that  is  described  in  this  chapter.  Posterior  Marginal 
Pose  Estimation  ’  (PMPE),  builds  on  MAP  Model  Matching.  It  provides  a  smooth 
objective  function  for  evaluating  the  pose  of  the  object  -  without  commitment  to  a 
particular  match.  The  pose  is  the  most  important  aspect  of  the  problem,  in  the  sense 
that  knowing  the  pose  enables  grasping  or  other  interaction  with  the  object. 

In  this  chapter,  the  objective  function  is  explored  by  probing  in  selected  parts  of 
pose  space.  The  domain  of  these  experiments  is  features  derived  from  synthetic  laser 
radar  range  imagery,  and  grayscale  video  imagery.  A  limited  pose  space  search  is 
]jertormed  in  the  video  X]>eriment. 

In  Chapter  8  the  Expectation  -  Maximization  (EM)  algorithm  is  discussed  as  a 
‘An  early  version  of  this  work  appeared  in  [76] 
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means  of  searching  for  local  maxima  of  the  objective  function  in  pose  space. 

•■Xciclitional  experiments  in  object  recognition  using  the  F’Mf^E  ol)jective  function 
are  clescribecf  in  Chapter  10.  There,  the  E.\l  algorithm  is  usecf  in  conjunction  with 
an  indexing  method  that  generates  initial  hypotheses. 


7.1  Objective  Function  for  Pose 

The  following  method  was  motivated  by  the  observation  that  in  heuristic  searches  over 
correspondences  with  the  objective  function  of  MAP  Model  Matching,  hypotheses 
having  implausible  matches  scored  poorly  in  the  objective  function.  The  implication 
was  that  sununing  posterior  probability  over  all  the  matches  (at  a  specific  pose)  might 
provide  a  good  pose  evaluator.  This  has  proven  to  be  the  case.  Although  intuitively, 
this  might  seem  like  an  odd  way  to  evaluate  a  pose,  it  is  at  least  democratic  in  that 
all  poses  are  evaluated  in  the  same  way.  The  resulting  pose  estimator  is  smooth, 
and  is  amenable  to  local  search  in  pose  space.  It  is  not  tied  to  specific  matches  - 
it  is  perhaps  in  keeping  with  Marr’s  recommendation  that  computational  theories  of 
vision  should  try  to  satisfy  a  principle  of  least  commitment  [.')2]. 

Additional  motivation  was  provided  by  the  work  by  Yuille  CJeiger  and  Bulthoff 
on  stereo  [78].  They  discussed  computing  disparities  in  a  statistical  theory  of  stereo 
where  a  marginal  is  computed  over  matches. 

In  M.AP  Model  Matching,  joint  hypotheses  of  match  and  pose  were  evaluated  by 
their  posterior  probability,  given  an  image  -  ?^(r, |  V  ).  E  and  S  stand  for  cor¬ 
respondences  and  pose,  respectively,  and  V’  for  the  image  features.  The  posterior 
probability  was  built  from  specific  models  of  features  and  correspondences,  o’ojects. 
and  f)rojection  that  were  describerl  in  the  j)revious  chajjters.  The  present  fijrmula- 
tion  will  first  be  des<  ribetl  using  the  inde])endent  correspondence  model.  Ese  of  the 
-Markov  correspondence  model  will  be  described  in  the  following  section. 

Here  we  use  the  same  strategy  for  evaluating  object  poses:  they  are  evaluated 
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by  their  posterior  probability,  given  an  image;  p{J  |  V  ).  The  posterior  prul)abilit\ 
density  of  the  pose  may  be  computed  from  the  joint  posterior  probability  on  }>ose  and 
match,  by  formally  taking  the  marginal  over  possible  matches: 

r 

In  Section  6.1,  Equation  6.4,  p(r,/d  |  V')  was  obtained  via  Bayes'  rule  from  prob¬ 
abilistic  models  of  image  features,  correspondences,  and  the  pose.  Substituting  for 
p(r.  3  I  >■'),  the  posterior  marginal  may  be  written  as 


I  V)  =  E 


p(v-|r,/j)p(r,.j) 

piY) 


I  sing  equations  2.1  (the  independent  feature  model)  and  6.2,  we  may  express  the 
posterior  marginal  of  8  in  terms  of  the  component  densities: 

'  ^  '  Ti  r2  Tn  •  ' 


j'i.'Or)  =  ^EZ--Enwv;ir..swr,)i 

*  '  Ti  r2  Tn  * 

Breaking  one  factor  out  of  the  product  gives 


I  V)  =  i’fflM'',  I  r,..;wr,)|]Mr«  I  r„.,<)p(r„) 

/  Pi  Pi  Pn  U=l 


=  I  r.,.(wr,)i  x;p(K,  |  r,„,j),,{r, 

'  r,  fj  r„_,  L.=i  J  Tn 
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(’ontiiiuing  in  similar  fashiun  yields 


p{y)  , 


Y^piv, !  r.. .i)p{r. 


r, 


This  mav  be  written  as 


p(  y )  . 


(7.2) 


since 


p(y:h^)  =  Yp(y:ir..j)p{rA 
r. 


(T.d) 


Splitting  the  T,  sum  into  its  cases  gives. 

p(y;  1  T)  =  p(v;  i  r.  =i,J)p(r,  =i)  +  i:p(v;  i  r.  =  .v/,..f)p(r.  =  .\/,)  . 

M, 

Substituting  the  densities  assumed  in  the  model  of  Section  6.1  in  Equations  6.1  and 
2.2  then  yields 


7.4) 


Installing  this  into  Equation  7.2  leads  to 


I  >■)  =  Jlgj  -  gn  P{S) 


M, 


m 


B. 


,As  in  Section  6.1  the  objective  function  for  Posterior  Marginal  Pose  Estimation  i,- 
defined  as  the  scaled  logarithm  of  the  posterior  marginal  probability  of  the  pc^se. 


L{T)  =  In 


pj.i  \  y 
r 
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where,  as  before. 


li’.r 


1 

/>(  V ) 


This  leads  to  the  following  expression  for  the  objertive  funrtion  (use  of  a  nornial  pi)>e 
prior  is  assumed) 

1(d)  =  ' (d  — do)+y^_  111 

“  t 

(7.')  I 


fi 


JK  1  -  B, 


Ch,^{v;  -  P(.\/,.  d 


This  objective  function  for  evaluating  pose  hypotheses  is  a  smooth  function  of  the 
pose.  Methods  of  ('ontinuous  optimization  may  be  used  to  search  h>r  local  maxima, 
although  starting  values  are  an  issue. 

The  first  term  in  the  PMPE  objective  function  (Equation  7.'))  is  due  to  the  pose 
prior.  It  is  a  quadratic  penalty  for  deviations  from  the  nominal  pose.  The  secomi 
term  essentially  measures  the  degree  of  alignment  of  the  object  model  with  the  image. 
It  is  a  sum  taken  over  image  features  of  a  smoc  '  non-linear  funrtion  that  peaks  up 
positively  when  the  pose  brings  object  features  into  alignment  with  the  image  featur<' 
m  ([uestion.  The  logarithmic  term  will  l»e  near  zero  if  there  are  no  moilel  features 
close  to  tlie  image  feature  in  question. 

In  a  straightforward  implementation  of  the  objective  function,  the  cost  <jf  evalu¬ 
ating  a  pose  is  O(rnri),  since  it  is  e.ssentially  a  non-linear  double  sum  over  image  and 
model  features. 


7.2  Using  the  Markov  Correspondence  Model 


When  the  Markov  Correspondence  model  of  Section  2. -'I  is  used  instead  of  the  in- 
deijendent  correspondence  model,  the  summing  techniques  of  the  previous  section 
no  longer  apply.  Because  of  this,  a  computationally  attractive  closed  form  formula 


92 


CHAPTER  7.  POSTERIOR  MARCHS AL  POSE  ESTIMATIOS 


tur  tlie  posterior  probability  uu  lunger  obtains.  .N'evertheless.  it  wiil  be  shown  that 
the  posterior  probability  at  a  pose  i  an  still  be  efficiently  evaluated  using  (Ivnainic 


prograninnng. 


Referring  to  Equation  7.1.  and  using  the  independence  of  niatch  and  pose  in  the 
prior  (discussed  in  Section  b.l  1.  the  posterior  marginal  probabilitv  of  a  pose  nia\  be 


written  as  follows. 


f){\  :  r..i)p(r)p(.f) 


I'sing  Equations  2..'}  and  6.1. 


pi'-i  I  >■)  =  MV.  I  i  r,..:;)-.  p(y;,  i  r^.jjqiroqir.i  -  qiVj 


This  mav  be  re-written  as  follows. 


Ill  ■))  ’* 

P(.UV)  =  ^  Z 

^  UiT;..  r.  l.  =  l  1=1 


where 


c,  =  p(v;  I  r,. .:J)7(r.)  . 


Here,  the  depeiiflence  of  c  on  S  has  been  suppressed  for  notational  brevitw 

.Next  it  will  be  shown  that  p(  .f  |  }  )  may  be  written  using  a  recurrence  relation: 


M> )  t: 


where 


/!,(«)  =  (/>-«) 


^  A,R6)cy+,(6)r„+i(6,  fi) 
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Expanding  Equation  7.7  in  terms  of  the  reeurrenre  relation, 

p^  ^ )  t:  [rtr, 

or 

=  E  n  e,(r.)r„_,(E,._„rj  . 

.=n-! 

Again  using  the  recurrence  relation, 

pi'^  1  >■)  =  E  E  /in-3(r„_,K_,(r„.,)r„_,(r,._,.r,.,) 

pO  )  r^TTr,.  Lrtr. 

71 

■  n  r„_i(r„_,. r,.)  . 

i  =  n- 1 

or 

nidi 

r(.UV)  =  i^  E  '’..-3(r„-2)  n  ‘•■(r.)  n  ’-.(r.-r,.,) 

)  r„_2r„_irn  t=n-2  :=n-2 

C7ontinuing  in  similar  fashion  leads  to 

^  >  r2ri,..r„  .=2  .=2 

and  now  using  the  base  expression  for  hj(-)- 


p{3  1  V')  = 


Pi‘^)  ^ 

?^(^")r2r^.r„ 


.Ti 


1=2  t=2 


or  finally. 


/'l•'l'■)  =  f7yT  E  n'-.irJll’-.ir-r,., 

^ ’  fifi ...r,  L=i  .=1 


which  is  the  same  as  Equation  7.6.  This  completes  the  verification  of  Ecpiation  7.7. 

Next,  a  dynamic  programming  algorithm  will  be  described  that  efficiently  evalu¬ 
ates  an  objective  function  that  is  proportional  to  the  posterior  marginal  probability 
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uf  a  post".  The  objective  tunction  is  i  i  i  ■  The  algorithm  is  a  direct  imple¬ 

mentation  of  the  recurrence  defined  in  Equations  7.7.  7.8  and  7.9.  that  fmilds  a  table 
of  values  of  hj  ■)  from  tlie  bottom  up.  Note  that  h,{bj  onl}'  lias  two  \alues.  depending 
on  whether  b  =-L  or  not.  In  the  following  description,  the  symbol  T  is  used  to  stand 
for  an  anonymous  model  feature.  H  denotes  array  locations  that  store  values  of  /i,. 
'UkI  is  an  access  function.  <lefiiied  behsw.  that  accesses  the  storeif  values. 

(  se  Dv/iamic  Programming  to  eva/uafe  PMPE  with  .\/arA'ov  f 'orrespoiuience  .\/o<ie/. 

EVALLATE-POSE(.:i) 

H,x 

Hit  ^i:,('il.b.S)nib,T} 

For  /  «—  2  To  .V  —  1 

Ha  Lb)(’ii.b.J)r,,^,(b.l) 

H.t  -  Ei.H(i  -  1.6)C(t.6..d)r„+,(6.T) 

Retur.n  (Et.  H(.V  -  l,6)('(n, 6.T)) 


Define  the  auxiliary  function 
3) 

RET(;R.V(/d>;  I  b3)fi(b]] 


:::  .dcce.s.s  values  of  H  stored  in  a  tal)!e. 

H(a.b) 

If  b  =-L  Retur.n  (H„j.) 

El.se  Return  (H„t) 

The  loop  in  EvaLUATE-Pose  executes  0(n)  times,  and  each  time  through  the 
loop  does  0(m)  evaluations  of  the  summands,  so  the  complexity  is  Of  tun  ).  This 
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has  the  same  complexity  as  a  straightforward  implementation  of  the  PMf*E  objective 
function  when  the  Markov  model  is  not  used  (Equation  7  5). 

The  summing  technique  used  here  was  described  by  C'heeseman  [17]  in  a  paper 
about  using  maximum-entropy  methods  in  expert  systems. 


7.3  Range  Image  Experiment 

.An  experiment  investigating  the  utility  of  Posterior  Marginal  Pose  Estimation  is  de¬ 
scribed  in  this  section.  Additional  experiments  are  described  in  Chapter  10. 

The  objective  function  of  Equation  7.5  was  sampled  in  a  domain  of  synthetic  range 
imagery.  The  feasibility  of  coarse-fine  search  methods  was  investigated  by  sampling 
smoothed  variants  of  the  objective  function. 

7.3.1  Preparation  of  Features 

The  preparation  of  the  features  used  in  the  experiment  is  summarized  in  Figure  7-1. 
The  features  were  oriented-range  features,  as  described  in  Section  5.4.  Two  sets  of 
features  were  prepared,  the  “model  features”,  and  the  “image  features". 

The  object  model  features  were  derived  from  a  synthetic  range  image  of  an  Md5 
truck  that  was  created  using  the  ray  tracing  program  associated  w'ith  the  BRL  C.AD 
Package  [23].  The  ray  tracer  was  modified  to  produce  range  images  instead  of  shaded 
images.  The  synthetic  range  image  appears  in  the  upper  left  of  Figure  7-2. 

In  order  to  simulate  a  laser  radar,  the  synthetic  range  image  described  above  was 
corrupted  with  simulated  laser  radar  sensor  noise,  using  a  sensor  noise  model  that 
is  described  by  Shapiro.  Reinhold,  and  Park  [b2].  In  this  noise  motlel.  measured 
ranges  are  either  valid  or  anomalous.  Valid  measurements  are  normally  distributed, 
and  anomalous  measurements  are  uniformly  distributed.  The  corrupted  range  image 
appears  in  Figure  7-2  on  the  right.  To  simulate  post  sensor  processing,  the  corrupted 
image  was  "restored"  via  a  statistical  restoration  method  of  Menon  and  Wells  [56]. 
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Figure  7-1:  F’reparatioii  of  Features 


Figure  7-2;  Synthetic  Range  Image,  Noisy  Range  Image,  and  Restored  Range  Image 


9') 


7  .1  RASCE  IMAGE  EXPERIMEST 

The  restored  range  image  appears  in  the  lower  position  of  Figure  7-2. 

Oriented  range  features,  as  described  in  Section  5.4,  were  extracted  from  the  syn¬ 
thetic  range  image,  for  use  as  model  features  -  and  from  the  restored  range  image, 
these  are  called  the  noisy  features.  The  features  were  extracted  from  the  range  images 
in  the  following  manner.  Range  dis'ontlnuities  were  located  by  thresholding  neigh¬ 
boring  pixels,  yielding  range  discontinuity  curves.  These  curves  were  then  segmented 
into  approximately  20-pixel-long  segments  via  a  process  of  line  segment  approxima¬ 
tion.  The  line  segments  (each  representing  a  fragment  of  a  range  discontinuity  curve) 
were  then  converted  into  oriented  range  features  in  the  following  manner.  The  .V  and 
T  coordinates  of  the  feature  were  obtained  from  the  mean  of  the  pixel  coordinates. 
The  normal  vector  to  the  pixels  was  gotten  via  least-squares  line  fitting.  The  range 
to  the  feature  was  estimated  by  taking  the  mean  of  the  pixel  ranges  on  the  near  side 
of  the  discontinuity.  This  information  was  packaged  into  an  oriented-range  feature, 
as  described  in  Section  5.4.  The  model  features  are  shown  in  the  first  image  of  Fig¬ 
ure  7-3.  Each  line  segment  represents  one  oriented-range  feature,  the  ticks  on  the 
segments  indicate  the  near  side  of  the  range  discontinuity.  There  are  1 13  such  objei't 
features. 

The  noisy  features,  derived  from  the  restored  range  image,  appear  in  the  second 
image  of  Figure  7-3.  There  are  62  noisy  features.  .Some  features  have  been  lost  due 
to  the  corruption  and  restoration  of  the  range  image.  The  set  of  image  features  was 
prepared  from  the  noisy  features  by  randomly  deleting  half  of  the  features,  transform¬ 
ing  the  survivors  according  to  a  test  pose,  and  adding  sufficient  randomly  generated 
features  so  that  |  of  the  features  are  due  to  the  object.  The  248  image  features  appear 
in  the  third  image  of  Figure  7-3. 

7.3.2  Sampling  The  Objective  Function 

The  oliiective  functioii  of  Equation  7. ,5  was  sampled  along  four  straight  lines  passing 
through  the  (known)  location  in  i>ose  space  of  the  test  pose.  Oriented  s,.alioiiri''\ 
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statistics  were  used,  as  described  in  Section  3.0.  The  stationary  feature  covariance 
was  estimated  from  a  hand  match  done  with  a  mouse  between  the  model  features  and 
the  noisy  features.  The  background  rate  parameter  B  was  set  to 

Samples  taken  along  a  line  through  the  location  of  the  true  pose  in  pose  space, 
parallel  to  the  .V  axis  are  shown  in  Figure  7-4.  This  corresponds  to  moving  the  object 
along  the  .V  axis.  The  first  graph  shows  samples  taken  along  a  100  pixel  length  (the 
image  is  256  pixels  square).  The  second  graph  of  Figure  7-4  shows  samples  taken 
along  a  10  pixel  length,  and  the  third  graph  shows  samples  taken  along  a  1  pixel 
length.  The  ,V  coordinate  of  the  test  pose  is  .55.5,  the  third  graph  shows  the  peak  of 
the  objective  function  to  be  in  error  by  about  one  twentieth  pixel. 

Samples  taken  along  a  line  parallel  to  the  fi  axis  of  pose  space  are  shown  in  Figure 
7-5.  This  corresponds  to  a  simultaneous  change  in  scale  and  angular  orientation  of 
the  object. 

Each  of  the  above  graphs  represents  50  equally  spaced  samples.  The  samples  are 
joined  with  straight  line  segments  for  clarity.  Sampling  was  also  done  parallel  to  the 
)  and  1/  axes  with  similar  results. 

The  sampling  described  in  this  section  show's  that  in  the  experimental  domain  the 
objective  function  has  a  prominent,  sharp  peak  near  the  correct  location.  Some  local 
maxima  are  also  apparent.  The  observed  peak  may  not  be  the  dominant  peak  -  no 
global  searching  was  performed. 

Coarse-Fine  Sampling 

.Additional  sampling  of  the  objective  of  Equation  7.5  was  performed  to  investigate  the 
feasibility  of  coarse-fine  search  techniques.  A  coarse-fine  search  method  for  finding 
maxima  of  the  pose-space  objective  function  would  proceed  as  follows.  Peaks  are 
initially  located  at  a  coarse  scale.  At  each  stage,  the  peak  from  the  previous  scale  is 
used  as  the  starting  value  for  a  search  at  the  next  (less  smooth)  scale. 

The  objective  function  was  smoothed  by  replacing  the  stationary  feature  covari- 


Figure  7-4:  Objective  Function  Samples  Along  X-Oriented  Line  Through  Test  Pose 
Lengths:  100  Pixels,  10  Pixels,  1  Pixel 
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Figure  7-5:  Objective  Function  Samples  Along  /i-Oriented  Line  Through  Test  Pose. 
Lengths:  .8.  .1.  and  .01 
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auce  matrix  i.'  in  the  tolluwing  manner: 

I.’  e-  +  1.'^  . 

Tlie  effect  of  the  smoothing  matrix  is  to  increase  the  spatial  scale  of  the  co- 
variance  matrices  that  appear  in  the  objective  function. 

Prol)es  along  the  .V  axis  through  the  known  location  of  the  test  pose,  with  various 
amounts  of  smoothing  are  shown  in  Figure  7-6.  The  smoothing  matrices  used  in  the 
probing  were  as  follows,  in  the  same  order  as  the  figures. 

DlAG((.l)^(.l)^(10.0)^(10.0)^)  , 

DIA(^((.025)^ (.025)^ (2.5)^ (2.5)^)  . 

and 

D1AG’((.01)^(.01)^1.0. 1.0)  . 

where  DI.-\G(-)  constructs  diagonal  matrices  from  its  arguments.  These  smoothing 
matrices  were  determined  empirically.  (No  smoothing  was  performed  in  the  fourth 
figure.) 

These  smoothed  sampling  experiments  indicate  that  coarse-fine  search  may  lie 
feasible  in  this  domain.  In  Figure  7-6  it  is  apparent  that  the  peak  at  one  scale  may 
be  used  as  a  starting  value  for  local  search  in  the  next  scale.  This  indicates  that  a 
final  line  search  along  the  X  axis  could  use  the  coarse  fine  strategy.  It  is  not  sufficient 
evidence  that  such  a  strategy  will  work  in  general.  .As  before,  there  is  no  guarantee 
that  the  located  maximum  is  the  global  maximum. 
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Figure  7-6:  X  Probes  in  Smoothed  Objective  Function 


7.4.  VIDEO  IMAOE  EXPERIMEST 


7.4  Video  Image  Experiment 

In  tins  sectiDu,  another  experiment  with  the  F’MF’E  objective  functiun  i>  <ies.  i ibed . 
The  features  are  point-railius  features  denveii  from  video  images.  .A  |i>.  h1  seaK  li  in 
pose  space  is  carried  out,  and  the  objective  function,  and  a  smoothed  variant,  are 
probed  in  the  vicinity  of  the  peak. 

7.4.1  Preparation  of  Features 

The  features  used  in  this  experiment  are  the  same  as  those  used  in  the  .\1.-\F’  .Model 
.Matcliing  correspondence  search  experimetd  reported  in  .Section  b.'i.  They  are  point - 
radius  features,  as  described  in  Section  5.d.  The  features  appear  in  Figure  (>-4. 

7.4.2  Search  in  Pose  Space 

A  search  was  carried  out  in  pose  space  from  a  starting  value  that  was  determined  by 
hand.  Tlie  .search  was  implemented  with  Powell's  method  [59]  of  multidimensional 
non-linear  optimization.  Powell's  method  is  similar  to  the  conjugate-gradient  method, 
lint  derivatives  are  not  used.  The  line  minimizations  were  carried  out  with  Brent's 
method  [59].  which  uses  successive  parabolic  approximations.  The  pose  resulting 
from  the  search  is  illustrated  in  Figure  7-7.  This  result  is  close  to  the  best  result 
from  the  M.AP  Model  Matching  correspondence  search  experiment.  That  result  is 
reproduced  here  in  Figure  7-8.  It  is  comforting  that  these  two  substantially  different 
search  methods  (combinatorial  versus  continuous)  provide  similar  answers  in.  at  least, 
one  experiment. 

7.4.3  Sampling  The  Objective  Function 

Samples  were  taken  along  four  straight  lines  passing  through  the  peak  in  the  objec¬ 
tive  function  resulting  from  the  search  in  pose  space  reported  above.  (In  the  range 
experiment,  sampling  was  done  through  the  known  true  pose.)  The  results  are  illus- 
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Figure  7-9;  Probes  of  Objective  Function  Peak 


traled  in  Figure  7-9.  The  peak  in  this  data  is  not  as  sharp  as  the  ])eak  in  the  range 
experiment  re])orted  in  the  previous  section.  This  is  likely  due  to  the  fac  t  that  the 
features  used  in  the  video  experiment  are  substantially  less  constraining  that  those 
used  in  the  range  exi)eriment  -  wliich  have  good  range  information  in  tliem. 

.Sampling  of  the  objective  function  with  smoothing  was  also  performed,  as  in 
Section  7.'{.2. 
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SinoDtliiiie;  was  performed  at  one  scale.  The  smootliinp;  matrix  was 

F’robing.  performed  in  the  same  manner  as  in  Figure  7-9  was  performed  on  the 
smoothed  objective  function.  Tlie  results  are  shown  in  Figure  7-10.  In  comparison 
to  the  range  image  experiment,  local  maxima  are  more  of  an  issue  here.  This  may  be 
partly  due  to  the  background  features  here  having  more  structure  than  the  randomly 
generated  background  features  used  in  the  range  image  experiment.  Because  of  this, 
anomalous  pose  estimates  (where  the  pose  corresponding  to  the  global  maximum  of 
the  objective  function  is  seriously  in  error)  may  be  more  likely  in  this  domain  than 
in  the  range  experiment. 

7.5  Relation  to  Robust  Estimation 

This  section  describes  a  relationship  between  PMPE  and  robust  estimation.  By 
simplifying  the  domain  a  robust  estimator  of  position  is  obtained.  A  connection 
between  the  simplified  robust  estimator  and  neural  networks  is  discussed. 

Consider  the  following  simplifications  of  the  domain: 

•  drop  the  pose  prior 

•  the  object  has  one  feature 

•  the  image  is  one-dimensional  with  width  VV 

•  the  pose  is  a  scalar 

•  the  projection  function  translates;  P{-.S)  =  .j 
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W’itli  tliese  simplifications,  the  observation  model  of  Ecjuation  6.1  l)ecomes 


p(>;  I  r.  .^) 


■i) 


if  r.  =± 

otherwise 


here 

In  this  simplified  domain  F  may  be  interpreted  as  a  collection  of  variables  that  lie- 
scribe  the  validity  of  their  corresponding  measurements  in  Y .  Thus.  F,  may  be 
interpreted  as  meaning  that  V,  is  valid,  and  F,  =±  as  V’,  being  invalid.  p{)])  is  defined 
to  be  zero  outside  of  the  range  ^]- 

The  prior  on  correspondences  of  Equation  2.2  takes  the  following  form 


Pi^,)  =  { 


B 


1  -  B 


if  F,  =1 
otherwise 


Using  Bayes'  rule  and  the  independence  of  F,  and  S  allows  the  following  probability 
of  a  sample  and  its  validity, 

f  f  ifF,  =X 

p(y;.F,  I  .i)  =  p(y;  1  F...i)p(F.)  =  {  d.io) 

( 1  —  BjCitrl y ,  —  T)  otherwise 

The  probability  of  a  sample  may  now  be  expressed  by  taking  a  marginal  over  the 
prol)ability  in  Equation  7.10.  as  follows. 

p(y;  I .:!]  =  ^p(y,. f.  |  .:f)  =  +  (i  -  /yicufy;  -  s)  . 

r,  ” 

Defining  an  objective  function  as  a  log  likelihood  of  S 
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leads  to  the  analog  of  the  PMPE  objertive  function  for  this  simplified  domain. 

r  B 

L(J)  =  Y^\n  —  +  (\  ~  -  J)  I  7,11) 

J 

This  may  also  be  written 

=  (7. 12) 

t 

where 

=  In  ^  +  (1  -  B)C;„(x)  . 

This  is  the  Maximum  Likelihood  objective  function  for  estimating  the  mean  of  a 
normal  population  of  variance  cr^,  that  is  contaminated  with  a  uniform  population  of 
width  \V,  where  the  fraction  of  the  mixture  due  to  the  uniform  population  is  B. 

The  function  .S'(x)  is  approximately  quadratic  when  the  residual  is  small,  and 
approaches  a  constant  when  the  residual  is  large.  When  B  goes  to  zero.  S(x)  becomes 
quadratic,  and  the  estimator  becomes  least  squares,  for  the  rase  of  a  pure  normal 
population.  When  — .‘s'(x)  is  viewed  as  a  penalty  function,  it  is  seen  to  provide  a 
quadratic  penalty  for  small  residuals,  as  least  squares  does,  but  the  penalty  saturates 
when  reside  .is  become  large.  Robust  estimation  is  concerned  with  estimators  that 
are.  like  this  one.  less  sensitive  to  outliers  that  least  squares.  .As  with  many  robust 
estimators,  the  resulting  optimization  problem  is  more  difficult  than  least  squares, 
since  the  objective  function  is  non-ronvex.  This  estimator  falls  into  the  class  of  re¬ 
descending  M-estimators  as  discussed  by  Huber  [41]. 

F’MPE  is  somewhat  different  from  robust  estimation  in  that  the  saturating  aspect 
of  the  objective  function  not  only  decreases  the  influence  of  "outliers"  (by  analogy, 
the  background  features),  it  also  reduces  the  influence  of  image  features  that  (hui't 
correspond  to  (are  not  close  to)  a  given  object  feature. 
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7.5.1  Connection  to  Neural  Network  Sigmoid  Function 

There  is  an  important  connection  between  the  estimator  of  Equation  7.12  and  the 
sigmoid  function  of  neural  networks. 


<t{x) 


1 

1  +  exp  (-x) 


The  sigmoid  function  is  a  smooth  variant  of  a  logical  switching  function  that  has 
been  used  for  modeling  neurons.  It  has  been  used  extensively  by  the  neural  network 
community  in  the  construction  of  networks  that  classify  and  exhibit  some  forms  of 
learning  behavior.  The  .VETtalk  neural  network  of  .Sejnowski  and  Rosenberg  [61]  is 
a  well  know  example. 

It  turns  out  that,  under  some  conditions  on  the  parameters,  the  sigmoid  function 
of  is  approximately  equal  to  .S'(x),  ignoring  shifting  and  scaling.  This  near  equality 
is  illustrated  in  Figure  7-11. 

The  two  functions  that  are  plotted  in  the  figure  are 


fix)  =  2.0[(7(x'^)  -  .5' 


and 


ln[.25  +  .75  exp  (-x^)] 

H25i 


The  upper  graph  shows  f{x)a7idg(x)  plotted  together,  while  the  lower  graph  shows 
their  difference.  It  can  be  see  i  t.  a*  they  agree  to  better  than  one  percent. 

Because  of  this  near  equality,  for  a  special  ase  of  the  parameters,  a  network  that 
evaluates  the  .ML  estimator  of  Equation  7.12  for  a  contaminated  normal  population 
will  have  the  form  illustrated  in  Figure  7-12. 

This  network,  with  its  arrangement  of  sigmoid  and  sum  units  seems  to  fit  the 
definition  of  a  neural  network. 

The  robust  estimator  of  Equation  7.12.  and  its  neural  network  approximation,  are 
(approximately)  optimal  for  locating  a  (Iau^;i;.i.  cluster  in  uniform  noise. 

.A  similar  neural  network  realization  of  the  P.MPE  objective  function  woidd  like¬ 
wise  be  near  optimal  for  locating  an  object  against  a  unitorm  background. 
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OBSERVATIONS 


L(p) 

Fis'ire  7-12:  Network  Implementation  of  MAF’  Estimator  for  Contaminated  N\)rmal 
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7.6  PMPE  Efficiency  Bound 


This  section  provides  a  lower  bound  on  the  rovarianee  matrix  of  the  PMPE  estimator. 
Estimators  of  vector  parameters  (like  pose)  may  be  characterized  by  the  covariance 
matrix  of  the  estimates  they  produce.  The  Cramer- Rao  bound  provides  a  lower 
limit  for  the  covariance  matrix  of  unbiased  estimators.  Unbiased  estimators  that 
achieve  this  bound  are  call  efficient  estimators.  Discussions  of  estimator  efficiency 
and  Cramer- Rao  bounds  appear  in  [63]  and  [72]. 

The  Cramer- Rao  bound  on  the  covariance  matrix  of  estimators  of  ,i  based  on 
observations  of  A’  is  given  by  the  inverse  of  the  Fisher  information  matrix. 

('0V(/})  >  . 

Here.  COV(-)  denotes  the  covariance  matrix  of  the  random  vector  argument.  This 
matrix  inequality  means  that  COVT/^)  —  is  positive  semi-definite. 

The  F  ishei  information  matrix  is  defined  as  follows. 

Ir(ii)  =  Ex{[Vj\np(X  ]  :i)][Vj\np(X  ]  3)]^) 

where  Vj  is  the  gradient  with  respect  to  3.  which  yields  a  coluimi-vector.  and  E.v(  ) 
stands  for  the  expected  value  of  the  argument  with  respect  to  p{X). 

The  covariance  matrix,  and  the  Cramer-Rao  bound,  of  the  PMPE  estimator  are 
difficult  to  calculate.  Instead,  the  Cramer-Rao  bound  and  efficiency  will  be  deter¬ 
mined  for  estimators  that  have  access  to  both  observed  features  and  the  corre¬ 
spondences  r,.  The  Cramer-Rao  bound  for  these  "complete-data"  estimators  will  be 
found,  and  it  will  be  shown  that  tliere  are  no  efficient  complete-data  estimators.  Be¬ 
cause  of  this,  the  P.MPE  estimator  is  subject  to  the  same  bound  as  the  complete-clata 
estimators,  and  the  P.MPE  estimator  cannot  be  efficient.  This  follows,  because  the 
P.MPE  estimator  can  be  ((jiisidered  to  l>e  technically  a  c(un|)lete-data  estimator  that 
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ignores  the  correspondence  data. 

In  terms  of  the  complete-data  estimator,  the  Fisher  information  has  the  following 
form. 

Ir{3)  =  Ev,r([V,lnp(>-.r  I  3)][Vj\np{Y,r  \  Sf)  .  (7. Id) 

•Assuming  imlepeixlence  of  feature  coordinates  and  of  correspondences,  tlie  prob¬ 
ability  of  the  complete-data  is 

p{YS\3)  =  Ylp(y\S,\3)  . 

I 

Using  Bayes  rule  and  the  independence  of  F  and  .i. 


p{Y,,  F.  I  ;i)  =  p(Y,  I  F,.  .y)p(r.) 


(7.14) 


Referring  to  Equations  6.1  and  6.3,  and  using  constant  background  probability  B. 
and  linear  projection,  the  complete-data  component  probability  may  be  written  as 
follows. 


p{Y,.r,\3)  = 


B 

H'l  ...iv, 


if  r. 


l-Bc' 

m  ^’'■'1 


(Y-Mj3)  ifF,  =  .V/, 


Working  towards  e.xpression  for  tlie  Fisher  information,  we  differentiate  the  complete- 
data  probability  to  obtain 


Vj\np{YJ\3)  =  VAu'[[P^y\S,\3)  = 


^V,p{Y.r,  I  3) 


When  F,  =-L.  V.jp(yi.r,  |  3)  =  0.  otherwise,  in  the  case  F.  =  Mj. 


v,p{y\.  r. !  3)  =  v.LJic,  (Y,  _  m^3)  . 

tn 
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Differentiating  the  normal  density  (a  formula  for  this  appears  in  S.-D,  gives 


-  B 


v,p(y..r, !  A)  =  (-) - G,jv;  -  .v/,J) 

It)  ' 


so 


that 


v.MV'er. 

I  3) 


-Mjvr/iY-MjA)  when  P.  =  A/, 


Then  the  gradient  of  the  complete-data  probability  may  be  expressed  as 


Vjiiip(r.r|.y)  =  -  Y.  -  .i^.y)  . 

Note  that  setting  this  expression  to  zero  defines  the  Maximum  Likelihood  estimator 
for  3  in  the  complete-data  case,  as  follows; 


or 


xj:r,  =  M,  = 


-I 


.y=  .w/v-'.r/J  •£  Mj'v-'y-  .  (t.ii) 

This  estimator  is  linear  in  V’.  The  inverse  has  been  assumed  to  exist  -  it  will  exist, 
provided  certain  linear  independence  conditions  are  met.  and  enough  correspondences 
to  model  features  appear  in  the  match.  This  typically  recjuires  two  to  four  correspon¬ 
dences  in  the  applications  described  here. 

Returning  to  the  Fisher  information,  we  need  to  evaluate  the  expectation: 


If  =  Ey.r 


/ 

- 

' 

■ 

T\ 
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Ihs 

where  the  ijWi  residual  has  been  written  as  follows. 


=  >;  -  M,s 


Re-naniing  indices  and  pulling  out  the  sums  gives 

If  =  Ey.r  [  ■ 

Referring  to  Equation  7.14,  the  expectation  may  be  split  and  moved  as  follows. 


If  =  Ep 


The  inner  expectation  is  over  mutually  independent  Claussian  random  vectors,  am 
equals  their  covariance  matrix  when  the  indices  agree,  and  is  zero  otherwise,  so 


\.rr.=A/, 

This  expression  simplifies  to  the  following; 


/p  =  Er(  x;  Mjv-'M,] 


The  sun.  may  be  re-written  in  the  following  way  by  using  a  delta  function  l  ompariiio 
r,  and  Mj, 


/f  =  x:  Er  ^  Er, 


7  >Aj 


M, 


The  expectation  is  just  the  probability  that  an  image  feature  is  matched  to  some 
model  feature.  This  is  so  the  Fisher  information  mav  be  written  in  the  fidlowine 

Tn 
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simple  form. 


H 


in 


J  u 


A/, 


or  as. 


/f  =  (1  -  B)n  - y.M 

*) 


mn 


j  u  J 


This  is  ail  attractive  resvilt.  and  may  he  easily  interpreted,  in  relation  to  the  Fisher 
information  for  the  pose  when  the  correspondences  are  fixed  (a  standard  linear  esti¬ 
mator).  The  Fisher  information  in  that  case  is  Yii,  -Wj.it  may  he  interpreted 

as  the  sum  over  matches  of  the  per-match  Fisher  information. 

In  light  of  this,  the  complete-data  Fislier  information  is  seen  to  he  the  average 
of  the  per-match  Fisher  information,  multiplied  hy  the  expected  number  of  features 
matched  to  the  model,  ( 1  —  B)n. 

.An  efficient  unbiased  estimator  for  the  complete-data  exists  if  and  only  if 


.}  =  .^  +  /^‘(.:#,V.,lnp(T.r  I  d)  . 

This  recpiires  that  the  right  hand  side  he  independent  of  .T  since  the  estimator  A 
(Ecjuation  7.1o)  is  not  a  function  of  T.  Expanding  tlie  right  hand  side. 

j:  ^ 

This  is  not  independent  ot  J.  One  way  to  see  this  is  to  note  that  the  lactor  multipiying 
■i  ii’.  the  second  term  is  a  function  of  F.  Thus,  no  efficient  estimator  exists  in  the 
complete-tiata  case,  and  consec]uently.  no  efficient  estimator  exists  for  PMf’E. 


inn 


7.7  Related  Work 


(Ireen  [.fl]  and  (Ireen  and  .Shapiro  [.'12]  describe  a  theory  of  .Maximum  Likelihood 
laser  railar  range  profiling.  The  researi  h  focu.ses  on  statistically  ojitimal  detectors 
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and  recognizers.  The  single  pixel  statl^tics,  are  >ies(Til)eii  Ity  a  nuxture  ot  uiutorm 
and  normal  components.  Range  proHling  is  implemented  using  the  EM  akoiithm. 
Emler  some  circumstances,  least  s(|uares  provides  an  ailetpjate  starting  value. 
continuatiun-s* vie  variant  is  described,  uiiere  a  range  accuracy  paratr.eter  is  varied 
between  E.M  convergences  from  a  coarse  value  to  its  true  value.  Clreen  [dl]  computes 
('ramer-Rao  bounds  for  the  complete-data  case  of  Maximum  Likeliliood  range  profile 
estimator,  and  compares  simulated  and  real-data  performance  to  the  limits. 

Cass  [lb]  [15]  describes  an  approach  to  visual  object  recognition  that  searches 
in  pose  space  for  ma.dmal  alignments  under  the  bounded-error  model.  The  pose- 
space  objective  function  used  there  is  piecewise  constant,  and  is  thus  not  amenable 
to  continuous  search  methods.  The  search  is  based  on  geometric  formulation  ol  the 
constraints  on  feasible  transformations. 

There  are  some  connections  between  PMPE  and  standard  methods  of  robust  pose 
estiitiation.  like  those  described  by  Haralick  [38].  and  Kumar  and  Hatison  ;48j.  Both 
can  provide  robust  estimates  of  the  pose  of  an  object.  l)ased  mi  an  observed  image. 
The  main  difference  is  that  the  standard  methods  require  speciHcation  ol  the  feature 
correspoiulences.  while  P.MPE  does  not  -  by  considering  all  possible  correspondences. 
F’.MPE  recjuires  a  starting  value  for  the  pose  (as  do  standard  methods  of  robust  pose 
estimation  that  use  non-convex  objective  functions). 

.A..S  mentioned  above,  \uille,  Cleiger  and  Biilthoff  [78]  discussed  computing  dis¬ 
parities  in  a  statistical  theory  of  stereo  where  a  marginal  is  computed  over  matche'-. 
5’uille  extends  this  technique  [79]  to  otlier  domains  of  vision  aiul  neural  netwoiks. 
among  them  winner-take-all  networks,  stereo,  long-range  motion,  the  traveling  sales¬ 
man  pnddem.  deformable  template  matching,  learning,  content  addressalde  memo 
ries.  and  models  of  brain  development.  In  addition  to  computing  marginals  o\er  dis¬ 
crete  fields,  the  Clibbs  probability  distribution  is  userl.  This  facilitates  continuation- 
style  ojitimizatioii  methods  by  variation  of  the  temperature  parameter.  There  are 
some'  similarities  between  this  a}j])r(jai'h  am!  using  coarse-fine  with  the  P.MPE  objec- 
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tive  function. 

hilflinan  and  Pog,a;io  ['J4j  describe  a  metliod  of  .fD  recognition  that  uses  a  trained 
( ieueralized  Radial  Basis  Fuin'tion  network.  Their  metliod  requires  l■orr»'s|)ondences 
to  he  known  during  training  and  recognition.  One  similarity  between  their  s(  heme 
and  f’.Mf'E  is  that  both  are  es.sentially  arrangements  of  smooth  ummoda)  functions. 

There  is  a  similarity  between  Posterior  .Marginal  Pose  Estimation  anil  Hough 
transform  (HT)  methods.  Roughly  .-.peaking,  HT  methods  evaluate  parameters  bv 
accumulating  votes  in  a  discrete  parameter  spare,  based  on  observed  feature's.  (See 
the  survey  paper  by  Illingworth  and  Kittler  [44]  for  a  discussion  of  Hough  methods.) 

In  a  recognition  application,  as  described  here,  the  HT  method  would  evaluate  a 
discrete  pose  by  counting  the  number  of  feature  pairings  that  are  exactly  consistent 
somewhere  within  the  cell  of  pose  space.  As  stated,  the  HT  method  has  difficulties 
with  noisy  features.  This  is  usually  addressed  by  counting  feature  pairings  that  are 
exactly  consistent  somewhere  nearby  the  cell  in  pose  space. 

The  utility  of  the  HT  as  a  stand-alone  method  for  recognition  in  the  presence  of 
noise  is  a  tojiic  of  some  controversy.  This  is  discu.ssed  by  (Irimson  in  [44].  pp.  2‘JU. 
I’erhaps  this  is  due  to  an  unsuitable  noise  model  implicit  in  the  Hough  Transform. 

Posterior  Marginal  Pose  Estimation  evaluates  a  pose  by  accumulating  the  loga¬ 
rithm  of  posterior  marginal  probability  of  the  pose  over  image  features. 

The  connection  between  HT  methods  and  paratneter  evaluation  via  the  logarithm 
of  posterior  probability  has  been  described  by  Stephens  [67].  Stephens  proposes  to  call 
the  [losterior  probability  of  iiarameters  given  image  observations  "The  Probabilistic 
Hough  Transform".  He  provided  an  example  of  estimating  line  parameters  Irom 
image  point  features  whose  probability  dei'.sities  were  described  as  having  uniform 
and  normal  components.  He  also  states  that  the  method  has  been  used  to  track  iff) 
objects,  referring  to  his  thesis  [68]  for  definition  of  the  method  used. 
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7.8  Summary 

A  niethocl  of  evaluating  poses  in  object  recognition.  Posterior  Marginal  f'ose  Estima¬ 
tion.  has  been  described.  The  resulting  objective  function  was  seen  to  have  a  simple 
form  when  normal  feature  deviation  models  and  linear  projection  models  are  used. 

Limited  experimental  results  were  shown  indicating  that  in  a  domain  of  synthetic 
range  discontinuity  features,  the  objective  function  may  have  a  prominent  sharp  peak 
near  the  correct  pose.  Some  local  maxima  were  also  apparent.  .Another  experiment, 
in  which  the  features  were  derived  from  video  images,  was  described.  Connections  to 
robust  estimation  and  neural  networks  were  examined.  Bounds  on  the  performance  of 
simplified  PMPE  estimators  were  indicated,  and  relation  to  other  work  was  discussed. 


Chapter  8 

Expectation  —  Maximization 
Algorithm 

The  Expectation  -  Maximization  (EM)  algorithm  was  introduced  in  its  general  form 
by  Dempster.  Rubin  and  Laird  in  1978  [21].  It  is  often  useful  for  computing  estimates 
in  domains  having  two  sample  spaces,  where  the  events  in  one  are  unions  over  events 
in  the  other.  This  situation  holds  among  the  sample  spaces  of  Posterior  .Marginal 
Pose  Estimation  (PMPE)  and  MAP  Model  Matching.  In  the  original  paper,  the  wide 
generality  of  the  EM  algorithm  is  discussed,  along  with  several  previous  appearances 
in  special  cases,  and  convergence  results  are  described. 

In  this  chapter,  a  specific  form  of  the  EM  algorithm  is  described  for  use  with 
FbMF’E.  It  is  used  for  hypothesis  refinement  in  the  recognition  experiments  thal  are 
described  in  (diapter  10.  Issues  of  convergence  and  implementation  are  discusseil. 

8.1  Definition  of  EM  Iteration 

In  this  section  a  variant  of  the  EM  algorithm  is  presented  for  use  with  Posterior 
.Marginal  Pose  Estimation,  which  was  described  in  Chapter  7.  The  following  modeling 
assumptions  were  used.  .Normal  models  are  used  for  matched  image  features,  while 
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uniform  models  are  used  for  unmatched  (background)  features.  If  a  prior  on  the  pose  is 
available,  it  is  normal.  The  independent  correspondence  model  is  used.  .Additionally, 
a  linear  model  is  used  for  feature  projection. 

In  PMPE,  the  pose  of  an  object,  d.  is  estimated  by  ma.ximizing  its  posterior 
probability,  given  an  image. 

.i  =  argmax  p(.i  (  >')  . 

a 

A  I  ecessary  condition  for  the  maximum  is  that  the  gradient  of  the  posterior  prob¬ 
ability  with  respect  to  the  pose  be  zero,  or  equi'.alently.  that  the  gradient  of  the 
logaritlim  of  the  posterior  probability  be  zero: 

0  =  Vj  lnp(/}  I  V')  .  (8,1 ) 

In  Section  7.1.  Equation  7.2  the  following  formula  was  given  for  the  posterior  prob¬ 
ability  of  the  pose  of  an  object,  given  an  image.  This  assumes  use  of  the  independent 
correspondence  model. 

I  >■)  =  I  ^ 

Imposing  the  condition  of  Equation  8.1  yields  the  following. 

0  =  111  -7^  +  111  P(^)  +  Y,  I 

L  .  J 

or 

p{‘h  ,  p{y,  I  ‘h 

.As  in  Equation  7.8,  we  may  write  the  feature  PDF  conditioned  on  pose  in  the 
following  way. 

p(y  I  •^)  =  I  rr^)7^(rj  . 

r. 

or.  using  the  specific  models  assumed  in  .Section  7.1.  as  reflected  in  Equation  7.4.  and 
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iisiiia;  a  linear  projeetion  mudel. 


po;  i  i)  =  77^ 


-  - ll',  ■  XI 


The  zero  gradient  condition  of  Ecjuation  iS.2  may  now  he  expressed  as  follow> 


0_  ^  ^E.VaC..,,(v;-.v/,d) 


•  IV 


With  a  normal  pose  prior. 


p(d)  = 


and  Vjp(J)  =  -p(,:l)irj' (J  -  .^)  . 


Tlie  gradient  of  the  other  normal  density  is 


VjG*,„(r,  -  =  -C;„„(V'  -  -  M,,i)  .  ,S.:1| 


Returning  to  the  gradient  condition,  and  using  these  expressions  (negated). 

0  - 1-,  ( J  -  So)  +  L  g,  (Y-M  S) 

Finally,  the  zero  gradient  cone  tion  may  be  expressed  compactly  as  follows. 


0  =  -  So)  +  i:  W,,Mj,p-/(Y,  -  M,S) 


with  the  following  definition; 


W  = 

B, 


G.{Yx  -  M,S) 


T^,  tV|  W2-  \v„  ^ 


IN.4) 


(8.5) 


Equation  8.4  has  the  appearance  of  being  a  linear  equation  for  the  pose  estimate  S 
that  satisfies  the  zero  gradient  condition  for  being  a  maximum.  Unfortunately,  it  isn't 
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rj(j 

a  linear  equation,  hecanse  (the  ••weights")  are  nut  constants,  tliev  are  functions 
ot  .i.  To  hinl  solutions  to  E(iuation  ''.4.  the  EM  algorithm  iterates  the  fulKnvina;  two 
step^: 


•  Treating  the  weights.  U  ,j  as  constants,  solve  Equation  >1,4  as  a  linear  e(]uation 
tor  a  new  pose  estimate  i.  This  is  referred  to  as  the  M  step. 

•  I  sing  the  most  recent  pose  estimate  ,4.  re-evaluate  the  weights.  14  ,^.  acccjrding 
to  Equation  S.').  This  is  referred  to  as  the  E  step. 

The  .\I  step  is  so  nametl  because,  in  the  e.xpusition  of  the  algorithm  in  [21].  it 
corresponds  to  a  .Maximum  Likelihood  estimate.  .As  discussed  there,  the  algorithm 
is  also  amenable  to  use  in  .MAP  formulations  (like  this  one).  Here  the  .M  step  corre¬ 
sponds  to  a  MAP  estimate  of  the  pose,  given  that  the  current  estimate  of  the  weights 
is  correct. 

The  E  step  is  so  named  because  calculating  the  corresponds  to  taking  the 
expectation  of  some  random  variables,  given  the  image  data,  and  that  the  most  recent 
pose  estimate  is  correct.  These  random  variables  have  value  1  if  the  Eth  image  feature 
corresponds  to  the  j‘th  object  feature,  and  0  otherwise.  Thus,  after  the  iteration 
converges,  the  weights  provide  continuous-valued  estimates  oi  tiie  correspondences, 
that  vary  between  0  and  1. 

It  seems  somewhat  ironic  that,  having  abandoned  the  correspondences  as  being 
part  of  the  hypothesis  in  the  formulation  of  F’MPE,  a  good  estimate  of  them  has 
re-appeared  as  a  byproduct  of  a  method  for  search  in  pose  space.  This  estimate,  the 
posterior  expectation,  is  the  minimum  variance  estimator. 

Being  essentially  a  local  method  of  non-linear  optimization,  the  EM  algorithm 
needs  good  starting  values  in  order  to  converge  to  the  right  local  maximum.  It  may 
be  started  on  either  step.  If  it  is  started  on  the  E  step,  an  initial  pose  estimate  is 
required.  When  started  on  the  .M  step,  an  initial  set  of  weights  is  needed. 

An  initial  set  of  weights  can  be  obtained  from  a  partial  hypothesis  of  correspon- 
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<leni'es  in  a  simple  manner.  The  weights  assoi-iate<i  with  each  >et  of  cuMt'spunctino 
leatiires  in  the  liyiJuthesis  are  set  to  1,  the  rest  to  0.  Indexing  metluxls  are  one  '.uiir<  e 
of  such  hypotheses.  In  Chapter  10.  .Angle  Pair  Indexing  is  used  to  generate  candidate 
hypotheses.  In  this  scenario,  indexing  provides  initial  alignments,  these  are  refined 
using  the  E.\l  algorithm,  then  they  are  verified  by  examining  the  value  of  the  peak  of 
th<'  P.MPE  objective  function  that  the  refinement  step  found. 


8.2  Convergence 

In  the  original  reference  [21],  the  E.\l  algorithm  was  shown  to  have  good  convergence 
properties  under  fairly  general  circumstances.  It  is  shown  that  the  likelihood  sequence 
produced  by  the  algorithm  is  monotonic,  i.e..  the  algorithm  never  reduces  the  value 
of  the  objective  function  (or  in  this  case,  the  posterior  probability)  from  one  step  to 
the  next.  Wu  [77]  claims  that  the  convergence  proof  in  the  original  E.\I  reference  is 
Hawed,  and  provides  another  proof,  as  well  a.s  a  thorough  discussion.  It  is  possit)le 
that  it  will  wander  along  a  ridge,  or  become  stuck  in  a  saddle  point. 

In  the  recognition  experiments  reported  in  Chapter  10  the  algorithm  typically 
converges  in  10  -  40  iterations. 


8.3  Implementation  Issues 

.Some  thresholding  methods  were  used  speed  up  the  computation  of  the  E  and  .\1 
steps. 

The  weights  W,j  provide  a  measure  of  feature  correspondence.  .As  the  algorithm 
operates,  most  of  the  weights  have  values  close  to  zero,  since  most  pairs  of  image  and 
object  feature  don't  align  well  for  a  given  pose.  In  the  computation  of  the  .\1  step, 
most  terms  were  left  out  of  the  sum,  based  on  a  threshold  for  VV',j.  Some  representative 
weights  from  an  experiment  are  displayed  in  Table  10.1  in  Chapter  10. 

In  the  E  step,  most  of  the  work  is  in  evaluating  the  Claussian  functions,  which  have 
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(|uadratic  forms  in  them.  For  the  reason  stated  above,  most  of  these  expressions  have 
\alues  very  <  Icjse  t(j  zero.  Tlie  evaluation  of  these  expressions  was  made  ouiditionai 
on  a  flireshold  test  applied  to  the  residuals  V,  —  M.  i.  When  the  lx.>  )  [uirt  ot  the 
residual  exeeedetl  a  certain  length,  zero  was  substituted  tor  the  value  ot  the  ( iaussian 
expression.  Tables  indexed  by  image  coordinates  might  provitle  another  etfective  wa\ 
of  implementing  the  thresholding  here. 

The  value  of  the  PMPE  objective  function  is  computed  as  a  byproduct  ot  the  F 
stej)  for  little  additional  cost. 


8.4  Related  Work 

The  work  of  Green  [31]  and  Green  and  Shapiro  [32]  that  is  discussed  in  Section  7.7 
describes  use  of  the  EM  algorithm  in  a  theory  of  laser  radar  range  profiling. 

Lipson  [50]  describes  a  non-statistical  method  for  refining  alignments  that  iterates 
solving  linear  systems.  It  matches  model  features  to  the  nearest  image  feature  under 
the  current  pose  hypothesis,  while  the  method  described  here  entertains  matches  to 
all  of  the  image  features,  weighted  by  their  probability.  Lipson's  method  was  shown 
to  be  effective  and  robust  in  an  implementation  that  refines  alignments  under  Linear 
Combination  of  Views. 


Chapter  9 


Angle  Pair  Indexing 


9.1  Description  of  Method 

A[igle  Pair  Indexing  is  a  simple  method  that  is  designed  to  reduce  the  amount  of 
search  needed  in  finding  matches  for  image  features  in  2D  recognition.  It  uses  features 
having  location  and  orientation. 

.An  invariant  property  of  feature  pairs  is  used  to  index  a  table  that  is  constructed 
ahead  of  time.  The  property  used  is  the  pair  of  angles  between  the  feature  orientations 
and  the  line  joining  the  feature’s  locations.  These  angles  are  0]  and  02  'n  Figure  9-1. 
The  pair  of  angles  is  clearly  invariant  under  translation,  rotation,  and  scaling  in  the 
plane. 

Fsing  orientations  as  well  as  point  locations  provides  more  constraint  than  point 
features.  Because  of  this,  indexing  may  be  performed  on  pairs  of  simple  features, 
rather  than  groups  of  three  or  more. 

The  table  is  constructed  from  the  object  features  in  a  pre-processing  step.  It  is 
indexed  by  the  angle  pair,  and  stores  the  pairs  of  object  features  that  are  consistent 
with  the  value  of  the  angles,  within  the  resolution  of  the  table.  The  algorithm  for 
constructing  the  table  appears  below. 

■A  distance  threshold  is  used  to  suppress  entries  for  features  that  are  very  close. 
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Figure  9-1:  Angles  for  Indexing 

Such  features  pairs  yield  sloppy  initial  pose  estimates  and  are  poor  initial  hypotheses 
for  recognition. 

;;;  Given  an  array  model- features  and  a  table  size,  n 

fills  in  the  2  index  array  ANGLE-PaIR-TaBLE  by  side-effect. 

Bu IL D- A. \GLE-Table( model-features,  n,  distance- threshold) 
m  <—  LE.'VGTH(model-features) 

;;  First  clear  the  table. 

For  i  <—  0  To  m 
For  j  0  To  m 

A.ngle-Pair-Table)!.  j]  0 
.Vow"  fill  in  the  table  entries. 

For  i  <—  0  To  m 
For  j  0  To  m 

fl  <—  model-features[i] 
f2  <—  model-features[j| 


O.l.  DESC'RIPTIOS  OF  METHOD 


in 


It  Dis TA N( ' H(  1 1 ,  t‘J )  >  <listainv  threshulil 

<  (1  r  >  ■^  CALcrLATE-lNDK  EsIfl.  fj.  n) 
AN(;LE-PAlR-TABLE[fi.  r]  ^  AN(;LE-PAlR-TAHLEai.  i 


The  following  lunction  is  used  to  ralculate  the  tal)le  indires  for  a  pair  of  fe.ituro, 
Xote  that  the  indexing  wraps  around  when  the  angles  are  innvased  !i\  t.  Ihi-' 
was  dune  liecause  the  features  used  in  the  recognition  experiments  ilescrihed  in  this 
research  are  often  straight  eilge  segments,  and  their  orientations  are  ambiguous  by 

:::  OaIruIa.te  indices  into  AnglE-PaiR-TablE  for  a  pair  of  features. 
CALCULATE-lND[(  ES(fl,  f2,  ll) 

e-  ^ 

n 

j  ^  '0 

return(<  i  j  >) 


The  following  algorithm  is  used  at  recognition-time  to  generate  a  set  of  pairs  of 
correspondences  from  image  features  to  object  features  that  have  consistent  values  of 
the  angle  pair  invariant.  The  indexing  operation  saves  the  expense  of  searchitig  for 
pairs  of  object  model  features  that  are  consistent  with  pairs  of  image  features.  Table 
entries  from  adjacent  cells  are  included  among  the  candidates  to  accommodate  angle 
values  that  are  '‘on  the  edge’’  of  a  cell  boundary. 

Map  over  the  pairs  of  features  in  an  image  and  generate 
candidate  pairs  of  feature  correspondences 
G  E  N  E  R  AT  E -  C  A  .N  D I D AT  ES  ( i  mage-  feat ures .  n ) 
candidates  0 
m  LE!VGTH(image-features) 


CHAFTER  ASCLE  FAIR  EADEXIM 


For  1  ‘ —  0  To  in 

F'or  j  »—  i  -r  1  to  m 

'1  r  >  CaLCI'I. A  I'E-InDICE.s  I  inia!i,e- to;it urt's  i  ,  innoie-toat uro- j  .  oi 

For  (^(/  - - 1  to  1 

For  d?‘  < - 1  to  1 

For  <  k  t  >  t  ANGLE-PaIR-TaBLE[(  ((/  +  d(/)  ino<l  n  ).((/'  -^  I'r  '  nioii  n  , 
('andiclates  «—  caiulidates  U  <  <  i  k  >  <  j  1  >  > 

Ret  urn  (candidates) 


9.2  Sparsification 

In  the  recognition  experiments  described  below  and  in  Section  10.1.  an  additional 
technique  was  used  to  speed  up  recognition-time  processing,  and  reduce  the  size  of 
the  table.  .As  the  table  was  built,  a  substantial  fraction  of  the  entries  were  left  out 
of  the  table.  These  entries  were  selected  at  random.  This  strategy  is  based  on  the 
following  observation:  For  the  purpose  of  recognizing  the  object,  it  is  (uily  necessary 
for  some  feature  pair  from  the  object  to  be  both  in  the  table  and  visible  in  the  image.  If 
a  reasonable  fraction  of  the  object  is  visible,  a  substantial  number  of  feature  pairs  will 
be  available  as  potential  partners  in  a  candidate  correspondence  pair.  It  is  unlikely 
that  the  orresponding  pairs  of  object  model  features  will  all  have  been  randomly 
eliminated  when  the  table  w'as  built,  even  for  fairly  large  amounts  of  sparsification. 


9.3  Related  Work 

Indexing  based  on  invariant  properties  of  sets  of  image  features  has  been  used  by 
Lamdan  and  Wolfson,  in  their  work  on  geometric  hashing  [49],  and  by  Clemens  and 
.Jacobs  [19][20],  .Jacobs  [dr)],  and  Thompson  and  Mundy  [70].  In  those  cases  the 
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iiivanaiict'  i>  with  I'especr  ft;  .itfiiit'  1 1  aii'ti^nnat  iuii>  liiai  fiiiii!  jun  anit'fti  ^  [ii 

rill'  'A'ui'k  r!i»^  iii\<u'Uiniv  is  witli  i>'spt^<  r  to  i  ra ii>lat lou.  rotation,  ami  'rail'  m  _’]). 
ulitM'f  tlifif  arc  four  parameters.  Thompson  anti  Muinls'  ile-t  rihe  an  in\ari,ini  c  allecl 
verfe.x  jjaii's.  These  are  based  on  angles  relatitig  to  pairs  ot  \t'rtirf‘'  ot  51)  j;ol\ ht'dr.i. 
ami  their  [iroject ions  into  2D.  .Angle  Pair  Inde.xing  i^  somewhat  similar,  hut  is  simpler 
heitig  designed  for  2D  from  2D  rerogiution. 

Clemens  ami  -laeohs  [1!)]  [20].  and  .laeohs  [laj  use  groui>ing  mechanisms  to  select 
small  sets  of  image  features  that  are  likely  to  helong  to  tin*  same  ohject  in  the  sctuie. 
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Chapter  10 

Recognition  Experiments 

This  chapter  describes  several  recognition  experiments  that  use  Posterior  Marginal 
Pose  Estimation  with  the  EM  Algorithm.  The  first  is  a  complete  2D  recognition 
system  that  uses  Angle  Pair  Indexing  as  the  first  stage.  In  another  experiment,  the 
PMPE  objective  function  is  evaluated  on  numerous  random  alignments.  Addition¬ 
ally,  the  effect  of  occlusions  on  PMPE  are  investigated.  Finally,  refinement  of  3D 
alignments  is  demonstrated. 

In  the  following  experiments,  image  edge  curves  were  arbitrarily  subdivided  into 
fragments  for  feature  extraction.  The  recognition  experiments  based  on  these  features 
show  good  performance,  but  the  performance  might  be  improved  if  a  more  stable 
subdivision  technique  were  used. 

10.1  2D  Recognition  Experiments 

The  experiments  described  in  this  section  use  the  EM  algorithm  to  carry  out  local 
searches  in  pose  space  of  the  PMPE  objective  function.  This  is  used  for  evaluating 
and  refining  alignments  that  are  generated  by  Angle  Pair  Indexing.  A  coarse  -  fine 
approach  is  used  in  refining  the  alignments  produced  by  Angle  Pair  Indexing.  To  this 
end,  two  sets  of  features  are  used,  coarse  features  and  fine  features. 
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Figure  lO-l:  Grayscale  Image 
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Figure  10-2:  Coarse  Model  and  Image  Features 


10. 1 .  ID  RECOCMTIOS  EXPERIMESTS 


l:{') 

riit"  vi<leo  image  used  tor  the  recognition  experiment  apj)eHrs  in  Figure  10-1.  The 
model  t'eat\ires  were  derived  from  .Mean  Edge  Images,  as  descrilied  in  Section  4.4. 
The  standard  deviation  of  the  smoothing  that  was  used  in  preparing  the  model  and 
image  edge  maps  was  d.97  for  the  coarse*  features,  and  1.93  for  the  tine  features.  The 
edge  curves  were  broken  arbitrarily  every  20  pixels  for  the  coarse  features,  and  every 
10  jjixels  for  the  tine  features.  Point-radius  features  were  fitted  to  the  edge  curve 
fragments,  as  described  in  Section  5.3.  The  coarse  model  and  image  features  appear 
in  Figure  10-2,  the  fine  mode!  and  image  features  appear  in  Figure  10-3.  There  are  SI 
coarse  model  features,  334  coarse  image  features,  246  fine  model  features,  and  1063 
fine  image  features. 

The  oriented  stationary  statistics  model  of  feature  fluctuations  was  used  (this 
is  described  in  Section  3.3).  The  parameters  (statistics)  that  appear  in  the  PMPE 
objective  function,  the  background  probability  and  the  covariance  matrix  for  the 
oriented  stationary  statistics,  were  derived  from  matches  that  were  done  by  hand. 
These  training  matches  were  also  used  in  the  empirical  study  of  the  goodness  of 
the  normal  model  for  feature  fluctuations  discussed  in  Section  3.2.1,  and  they  are 
described  there. 

10.1.1  Generating  Alignments 

Initial  alignments  were  generated  using  Angle  Pair  Indexing  (described  in  Chapter  9) 
on  the  coarse  features.  The  angle  pair  table  was  constructed  with  80  by  80  cells,  and 
sparsification  was  used  -  b  percent  of  the  entries  were  randomly  kept.  The  distance 
threshold  was  set  at  50  pixels  (the  image  size  is  640  by  480).  The  resulting  table 
contained  234  entries.  With  these  values,  uniformly  generated  random  angle  pairs 
have  .0365  probability  of  “hitting”  in  the  table. 

When  the  image  feature  pairs  were  indexed  into  the  table,  20574  candidate  feature 
correspondence  pairs  were  generated.  This  is  considerably  fewer  that  the  732  million 
possible  pairs  of  correspondences  in  this  situation.  Figure  10-4  illustrates  three  of 
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rlie  candidate  alignments  by  superimposing  the  object  in  the  images  at  tlie  pose 
associated  with  tlie  initial  alignment  implied  by  the  pairs  ot  feature  correspondences. 
The  indicated  scores  are  the  negative  of  the  PMPE  objective  function  computed  with 
the  coarse  features. 

10.1.2  Scoring  Indexer  Alignments 

The  initial  alignments  were  evaluated  in  the  following  way.  The  indexing  process 
produces  hypotheses  consisting  of  a  pair  of  correspondences  from  image  features  to 
object  features.  These  pairs  of  correspondences  were  converted  into  an  initial  weight 
matrix  for  the  EM  algorithm.  The  M  step  of  the  algorithm  was  run,  producing  a 
rough  alignment  pose.  The  pose  wais  then  evaluated  using  the  E  step  of  the  EM 
algorithm,  which  computes  the  value  of  the  objective  function  as  a  side  effect  (in 
addition  to  a  new  estimate  of  the  weights).  Thus,  running  one  cycle  of  the  EM 
algorithm,  initialized  by  the  pair  of  correspondences,  generates  a  rough  alignment, 
and  evaluates  the  PMPE  objective  function  for  that  alignment. 

10.1.3  Refining  Indexer  Alignments 

This  section  illustrates  the  method  used  to  refine  indexer  alignments. 

Figure  10-5  shows  a  closer  view  of  the  best  scoring  initial  alignment  from  .Angle 
Pair  Indexing.  The  initial  alignment  was  refined  by  running  the  EM  algorithm  to  con¬ 
vergence  using  the  coarse  features  and  statistics.  The  result  of  this  coarse  refinement 
is  displayed  in  Figure  10-6.  The  coarse  refinement  was  refined  further  by  running  the 
EM  algorithm  to  convergence  with  the  fine  features  and  statistics.  The  result  of  this 
fine  refinement  is  shown  in  Figure  10-7,  and  over  the  video  image  in  Figure  10-8. 

Ciround  truth  for  the  pose  is  available  in  this  experiment,  as  the  true  pose  is  the 
null  pose.  The  pose  before  refinement  is 


,99595,  -0.0084747,  -0.37902, 5.0048]^  . 
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Figure  10-4:  Poses  and  Scores  of  Some  Indexed  Hypotheses 
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Figure  10-9:  Correspondences  with  Weight  Larger  than  .o 


and  after  the  refinement  it  is 

[1.00166,0.0051108,0.68621,-1.7817]^  . 

The  encoding  of  these  poses  is  described  in  Section  5.3  (the  null  pose  is  [1, 0, 0. 0]^.) 
The  initial  pose  is  in  error  by  about  .01  in  scale  and  5  pixels  in  position.  The  final 
pose  errs  by  about  .005  in  scale  and  1.8  pixels  in  position.  Thus  scale  accuracy  is 
improved  by  a  factor  of  about  two,  and  position  accuracy  is  improved  by  factor  of 
about  three.  An  experiment  showing  more  dramatic  improvement  is  described  below, 
in  Section  10.4.1. 

In  these  experiments,  less  that  15  iterations  of  the  EM  algorithm  were  needed  for 
convergence. 


10.1.4  Final  EM  Weights 
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As  discussed  in  Section  8.1,  a  nice  aspect  of  using  the  EM  algorithm  uitli  PMf^E  i> 
that  estimates  of  feature  correspondences  are  available  in  the  weight  matrix.  Figure 
10-9  displays  the  correspondences  that  have  weight  greater  than  .5.  for  the  final 
convergence  shown  in  Figure  10-7.  Here,  the  image  and  model  features  are  cb.^plawd 
as  thin  curves,  and  the  correspondences  between  them  are  shown  as  heavy  lines 
joining  the  features.  Note  the  strong  similarity  between  these  correspondences,  and 
those  that  the  system  was  trained  on,  shown  in  Figure  .1-2. 

Table  10.1  displays  the  values  of  some  of  the  weights.  The  weights  show  have 
value  greater  than  .01.  There  are  299  weights  this  large  among  the  4  El, 507  weights. 
The  59  weights  shown  are  those  belonging  to  20  image  features. 


10.2  Evaluating  Random  Alignments 

.An  experiment  was  performed  to  test  the  utility  of  PMPE  in  evaluating  randomly 
generated  alignments.  Correspondences  among  the  coarse  features  described  in  .Sec¬ 
tion  10.1  having  assignments  from  two  image  features  to  two  model  features  were 
randomly  generated,  and  evaluated  as  in  Section  10.1.2.  19118  random  alignments 
were  generated,  of  which  1200  had  coarse  scores  better  than  -50.0  (the  negative  of 
the  PMPE  objective  function).  Among  these  1200,  one  was  essentially  correct.  The 
first,  second,  third,  fourth,  fifth,  and  fifteenth  best  scoring  alignments  are  shown  in 
Figure  10-10. 

With  coarse  -  fine  refinement,  the  best  scoring  random  alignment  converged  to 
the  same  pose  as  the  best  refinement  from  the  indexing  experiment,  shown  in  Figure 
10-7.  with  fine  score  -555.069.  The  next  best  scoring  random  alignment  converged  to 
a  grossly  wrong  pose,  with  fine  score  -149.064.  This  score  provides  some  indication 
of  the  noise  level  in  the  fine  PMPE  objective  function  in  pose  space. 

This  test,  though  not  exhaustive,  produced  no  false  positives,  in  the  sense  of  a  bad 
alignment  with  a  coarse  score  better  than  that  of  the  correct  alignment.  Additionally. 
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Image  Index 

Model  Index 

Weight 

00 

86 

0.022738026840027032 

90 

101 

0.014615921646994348 

!)0 

102 

0.80796669.3444096 

90 

103 

0.09.581539482455806 

91 

103 

0.963.3441301926663 

92 

85 

0.24166197059125494 

92 

103 

0.19778274847425015 

93 

87 

0.02784697957543993 

93 

88 

0..374 192 18245379466 

94 

87 

0.7478667723520142 

95 

87 

0.44030413275215486 

96 

86 

0.6127902576993082 

97 

85 

0.9293665165549775 

98 

85 

0.8621763443868999 

99 

84 

0.9634827438267516 

100 

5 

0.6499527214931725 

100 

84 

0.19705210016850308 

101 

0 

0.01 1400725934573982 

101 

67 

0.9559675939354566 

102 

66 

0.9194110795990801 

102 

67 

0.054164359353351 1 

103 

64 

0.04765362703894284 

103 

65 

0.8454128520499249 

103 

66 

0.05787873660955701 

104 

63 

0.05270908685541295 

104 

64 

0.8854088356653954 

104 

65 

0.014744194821866506 

105 

62 

0.06158503222464117 

105 

63 

0.9139939556525918 

106 

61 

0.09270729594689026 

106 

62 

0.8635739185353283 

106 

63 

0.010447389024937672 

107 

61 

0.9108899984969661 

107 

62 

0.021204649868405194 

108 

60 

0.861831671427887 

108 

61 

0.049220125250993084 

109 

58 

0.018077232316743887 

109 

59 

0.925731118.3042934 

109 

60 

0.01.5434004217119.308 

Table  10. 1 :  Some  EM  Weights 
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Figure  10-10:  Random  Alignments 
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the  tine  score  of  the  reHiieiiient  of  the  most  promisine;  iiu^rrect  random  alignment 
was  sioniHcantly  lower  than  the  fine  score  of  the  (correct)  refined  best  alignment. 


10.3  Convergence  with  Occlusion 

The  convergence  behavior  under  occlusion  of  the  EM  algorithm  with  PMPE  was  eval¬ 
uated  using  the  ('oarse  features  described  in  Section  10.1.  Images  features  simuhiting 
varying  amounts  of  occlusion  were  prepared  by  shifting  a  varying  portion  of  the  image. 
These  images,  along  with  results  of  coarse  -  fine  refinement  using  the  EM  algorithm 
are  shown  in  Eigure  10-11. 

The  starting  value  for  the  pose  was  the  correct  (null)  pose.  The  refinements 
converged  to  good  poses  in  all  cases,  demonstrating  that  the  method  can  converge 
from  good  alignments  with  moderate  amounts  of  occlusion. 

The  final  fine  score  in  the  most  occluded  example  is  lower  than  the  noise  level 
observed  in  the  experiment  of  Section  10.2.  This  indicates  that  as  the  amount  of 
occlusion  increases,  a  point  will  be  reached  where  the  method  will  fail  to  produce  a 
good  pose  liaving  a  score  above  the  noise  level.  In  this  experiment  it  happens  before 
the  method  fails  to  converge  properly. 

10.4  3D  Recognition  Experiments 

10.4.1  Refining  3D  Alignments 

This  section  demonstrates  use  of  the  EM  algorithm  with  PMPE  to  refine  alignments 
in  dD  recognition.  The  linear  combination  of  views  method  is  used  to  acconunodate 
a  limited  amount  of  out  of  plane  rotation.  A  two-view  variant  of  LCVd  described  in 
Section  oA,  is  used. 

.\  coarse  -  fine  approach  was  used.  Coarse  PMPE  scores  were  computed  by 
smoothing  the  PMPE  objective  function,  as  described  in  Section  7. .3. 2.  The  smoothing 
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Figure  10-11:  Fine  Convergences  with  Occlusion 
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Figure  10-12:  Grayscale  Image 


matrix  was 

DIAG((7.07)^{3.0)^)  . 

These  numbers  are  the  amounts  of  additional  (artificial)  variance  added  for  parallel 
and  perpendicular  deviations,  respectively,  in  the  oriented  stationary  statistics  model. 

The  video  test  image  is  shown  in  Figure  10-12.  It  differs  from  the  model  images 
by  a  significant  3D  translation  and  out  of  plane  rotation.  The  test  image  edges  are 
shown  in  Figure  10-13. 

The  object  model  was  derived  from  the  two  Mean  Edge  Images  shown  in  Figure 
10-14.  These  were  constructed  as  described  in  Section  4.4. 

The  smoothing  used  in  preparation  of  the  edge  maps  had  1.93  pixels  standard 
deviation,  and  the  edge  curves  were  broken  arbitrarily  every  10  pixels.  Point-radius 
features  were  fitted  to  the  edge  curve  fragments,  as  described  in  Section  5.3.  for 
purposes  of  display  and  for  computing  the  oriented  stationary  statistics,  although  the 
features  used  with  PMPE  and  the  EM  algorithm  were  simply  the  X  and  V  coordinates 
ot  the  centroids  of  the  curve  fragments.  Both  views  of  the  model  features  are  shown 
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Figure  10-13:  Image  Edges 

in  Figure  10-15.  The  linear  combination  of  views  method  requires  rorrespoiniences 
among  the  model  views.  These  were  established  by  hand,  and  are  displayed  in  Figure 
10-16. 

The  relationship  among  the  viewpoints  in  the  model  images  and  the  test  image  is 
illustrated  in  Figure  10-17.  This  represents  the  region  of  the  view  sphere  containing 
the  viewpoints.  .Note  that  the  test  image  is  not  on  the  line  joining  the  two  model 
views. 

The  oriented  stationary  statistics  model  of  feature  fluctuations  was  used  (this  is 
described  in  Section  3.3).  As  in  Section  10.1,  the  parameters  (statistics)  that  appear  in 
the  PMPE  objective  function,  the  background  probability  and  the  covariance  matrix 
for  the  oriented  stationary  statistics,  were  derived  from  matches  done  by  hand. 

.A  set  of  four  correspondences  was  established  manually  from  the  image  features 
to  the  object  features.  These  correspondences  are  intended  to  simulate  an  alignment 
generated  by  an  indexing  system.  Indexing  systems  that  are  suitable  for  3D  recogni¬ 
tion  are  described  by  Clemens  and  .Jacobs  [19]  and  Jacobs  [45],  The  rough  aligii'upu' 
and  score  were  obtained  from  the  correspondences  by  one  cycle  of  the  EM  algorithm. 


CHARTER  10.  RECOCMTIOS  EXRERIMEMS 


Figure  10-16;  Model  Correspondences 


Figure  10-17:  Model  and  Test  Image  View  Points 


Figure  10-18:  Initial  Alignment 
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Figure  10-19:  Coarse  Refined  Alignment  and  Coarse  Score 


Figure  10-20:  Fine  Refined  Alignment  and  Fine  Score 
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Figure  10-21:  Fine  Refined  Alignment  with  Video  Image 
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as  (les('ril)eil  above  in  Section  10.1.2.  They  are  displaced  in  Fie;ure  10- M.  where  the 
four  corresponding  features  appear  circled.  .A  coarse  alignment  was  then  ulitained 
bv  running  the  E.\l  algorithm  to  convergence  with  smoothing,  the  result  appears  in 
Figure  10-19.  This  alignment  was  refined  further  by  running  the  E.\I  algorithm  again, 
without  smoothing.  The  resulting  alignment  and  score  are  shown  in  Figure  10-20.  In 
these  figures,  the  image  features  are  shown  as  curve  fragments  for  clarity,  although 
only  the  point  locations  were  used  in  the  e.xperiment.  The  image  features  used  are  a 
subset  taken  from  a  rectangular  region  of  the  larger  image. 

Figure  10-21  displays  the  final  alignment  superimposed  over  the  original  video 
image.  .Most  of  the  model  features  have  aligned  well.  The  discrepancy  in  the  forward 
wheel  well  may  be  due  to  inaccuracies  in  the  LCV  modeling  process,  perhaps  in  the 
feature  correspondences.  This  figure  demonstrates  good  results  for  aligning  a  smooth 
3D  object  having  six  degrees  of  freedom  of  motion,  without  the  use  privileged  features. 

10.4.2  Refining  Perturbed  Poses 

This  section  describes  an  additional  demonstration  of  local  search  in  pose  space  using 
PMPE  in  3D. 

The  pose  corresponding  to  the  refined  alignment  displayed  in  Figure  10-20  was 
perturbed  by  adding  a  displacement  by  4.0  pixels  in  Y.  This  pose  was  then  refined 
by  running  the  EM  algorithm  to  convergence.  The  perturbed  alignment  and  the 
resulting  coarse  -  fine  refinement  is  shown  in  Figure  10-22.  The  result  is  very  close 
to  the  pose  prior  to  perturbation. 

A  similar  experiment  was  carried  out  with  a  larger  perturbation,  12.0  pixels  in 
y^.  The  results  of  this  appear  in  Figure  10-23.  This  time  the  convergence  is  to 
a  clearly  wrong  alignment.  The  model  has  been  stretched  to  a  thin  configuration, 
and  nusmatched  to  the  image.  The  resulting  fine  score  is  lower  than  that  of  the 
good  alignment  in  Figure  10-21.  This  illustrates  a  potential  drawback  of  the  linear 
combination  of  views  method.  In  addition  to  correct  views,  LCV'  can  synthesize 
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Figure  10-23:  Perturbed  Alignment  and  Resulting  Refinement  with  Fine  Score 
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Figure  10-24:  Bad  Alignmeul  and  Resulting  Refinement  with  Fine  Srore 
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vit-ws  where  tlie  model  is  stretched.  L(’\’.  as  used  here,  has  S  parameters,  rather 
than  the  (i  of  rigid  motion.  The  two  extra  parameters  determine  the  stretching  part 
of  the  transformation.  This  problem  can  be  addresseil  b\  checking,  or  enforcing,  a 
quadratic  constraint  on  the  parameters.  This  is  discussed  in  [Tlj. 

.Another  similar  experiment  was  performed  starting  with  a  very  bad  alignment. 
The  results  appear  in  Figure  1Q-J4.  The  algorithm  was  able  to  bring  some  features 
into  alignment,  but  the  score  remained  low. 
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Conclusions 


Visual  object  recognition  -  finding  a  known  object  in  scenes,  where  the  object  is 
smooth,  is  viewed  under  varying  illumination  conditions,  has  six  degrees  of  freedom 
of  position,  is  subject  to  occlusions  and  appears  against  varying  backgrounds  -  still 
presents  problems.  In  this  thesis,  progress  has  been  made  by  applying  methods  of 
statistical  inference  to  recognition.  Ever-present  uncertainties  are  accommodated 
by  statistical  characterizations  of  the  recognition  problem:  MAP  Model  Matching 
(MMM)  and  Posterior  Marginal  Pose  Estimation  (PMPE).  .MM,M  was  shown  to  be 
effective  for  searching  among  feature  correspondences  and  P.VIPE  was  shown  effective 
for  searches  in  pose  space.  The  issue  of  acquiring  salient  object  features  under  varying 
illumination  was  addressed  by  using  Mean  Edge  Images. 

The  alignment  approach,  which  leverages  fasi  indexing  methods  of  hypothesis 
generation,  is  utilized.  Angle  Pair  Indexing  is  introduced  as  an  efficient  2D  indexing 
method  that  does  not  depend  on  extended  or  special  features  that  can  be  hard  to 
detect.  .An  extension  to  the  alignment  approach  that  may  be  sununarized  as  align 
refine  verify  is  advo''ated.  The  EM  algorithm  is  employed  for  refining  the  estimate  of 
the  object's  pose  while  simultaneously  identifying  and  incorporating  the  constraints 
of  all  supporting  image  features. 

Areas  for  future  research  include  the  following; 
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•  liKlexiiig  was  not  used  in  the  .{D  rerogiution  experiinents.  Identifying;  a  Miitalde 
mechanism  for  this  purpose  that  meshes  well  with  the  type  of  features  used  here, 
wijuld  he  an  improvement. 

•  Too  few  views  were  u.se.l  in  model  <onstrurtion.  Fully  autmnatins  the  mo<lel 
a<-(|uisition  process  as  .lescrihed  in  Chapter  4.  and  acquiring  mo.lels  from  more 
views  would  help. 

•  Extending  the  formulations  of  recognition  to  handle  multiple  objects  is  straight¬ 
forward.  but  identifying  suitable  search  strategies  is  an  important  and  non¬ 
trivial  task. 

•  Incorporating  non-linear  models  of  projection  into  the  formulation  would  allow 
robust  performance  in  domains  having  serious  perspective  distortions. 

•  Csing  image-like  tables  could  speed  the  evaluation  of  the  PMPE  objective  func¬ 
tion. 

•  Investigating  the  use  of  PMPE  in  object  tracking  or  in  other  active  vision  do¬ 
mains  might  prove  fruitful. 

.More  work  in  these  areas  will  lead  to  practical  and  robust  object  recognition 
systems. 


Appendix  A 
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.\PFE\DIX  A.  SOr.M'lOS 


Notation 


Symbol 

Mfaning 

Dfjininy  Stctiou 

.v;} 

the  image 

2.1 

n 

number  of  image  features 

Y,  G 

image  feature 

2.1 

M  =  {.V/,,A/,. 

. v/„j 

the  object  model 

2.1 

m 

number  of  object  features 

■V/. 

model  feature,  frequently  Mj  G 

2.1 

J- 

the  background  feature 

2.1 

r  =  {r„r2,... 

.r,j 

correspondences 

2.1 

r.  €  Mu  {1} 

assignment  of  image  feature  i 

2.1 

S  €  /?*' 

pose  of  object 

5.1 

projection  into  image 

5.1 

G^.{x) 

Gaussian  probability  density 

3.2  6.1 

covariance  matrix  of  feature  pair 

3.3 

V 

stationary  feature  covariance  matrix 

3.3 

covariance  matrix  of  pose  prior 

6.1 

background  probability 

2.2  2.4 

Wk 

extent  of  image  feature  dimension  k 

3.1 

Ajj ,  A 

correspondence  reward 

6.1 

X 

estimate  of  x 

Pi-) 

probability 

(see  below) 

F’robability  notation  is  somewhat  abused  in  this  work,  in  the  interest  of  brevity. 
p{x)  may  stand  for  either  a  probability  mass  function  of  a  discrete  variable  x,  or  for  a 
probability  density  function  of  a  continuous  variable  x.  The  meaning  will  be  clear  in 


context  l)ase<i  on  the  type  uf  the  variahle  ar»uinent.  A<hlitionally.  nuxeii  prohalnlit les 
are  (lescril)eil  with  the  same  notation.  For  example  iJi  T.  i  j  V  )  stanils  tor  the  mixed 
prohahility  function  that  is  a  probability  mass  function  ot  F  (the  cliscrete  variable 
describing  correspondences),  and  a  probability  density  function  ot  i  ( tiie  pose  \ector  ) 
-  both  conditioned  on  Y  (the  image  feature  coordinates). 
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